Upcoming LunaMetrics Seminars
Seattle, Nov 3-7 New York City, Nov 17-21 Washington DC, Dec 1-5 Los Angeles - Anaheim, Dec 8-12

Author Archive

Using Latent Dirichlet Allocation to Brainstorm New Content


I recently had a problem with my client – I ran out of things to write about. The client, a chimney sweep, has been with our company for 3 years and in that time we have written every article under the sun informing people about chimneys, the issues they cause, potential hazards, and optimal solutions. All of that writing has worked and worked well. We have seen over 100% traffic increases YoY. The challenge now is to keep that momentum.

Brainstorming sessions weren’t working. They looked more like a list of accomplishments than of new ideas. Each new idea seemed like we were slightly changing an already successful article written in the past. I wanted something new and I wanted to make sure it was tied to a strategy. Tell me if this sounds familiar!

So I internalized the problem. I let it smolder and waited for the answer. Then while reflecting on the effects of website architecture and content consolidation, topic modeling popped into my head. If I could scrape the content we’ve already written and throw it into an Latent Dirichlet Allocation (LDA) model I could let the algorithm do the brainstorming for me.

For those of you unfamiliar with Latent Dirichlet Allocation  it is:

“a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.” -Wikipedia

All that basically say is that there are a lot of articles on a website, each of those articles is related to a topic of some sort, by using LDA we can programmatically determine what the main topics of a website are. (If you want to see a great visualization of LDA at work on 100,000 Wikipedia articles, check this out.)

So, by applying LDA to our previously written articles, we can hopefully find areas to write about that will help my client be seen as more authoritative in certain topics.

So I got to researching. The two tools I found which allowed me to quickly test this idea were a content scraper by Kimono and a Topic Modeling Tool I found on code.google.com.

Scrape Content With Kimono

Kimono has an easy to use web application that uses a Chrome extension to train the scraper to pull certain types of data from a page. You are then able to give Kimono a list of URLs that have similar content and have it return a CSV of all the information you need.

Training Kimono is easy; data selection works similar to the magnifying glass feature of many web dev tools. For my purposes I was only interested in the header tag text and body content. (Kimono does much more than this, I recommend you check them out). Kimono’s video about extracting data will give you a better idea of how easy this is. When it’s done Kimono gives you a CSV file you can use in the topic modeling tool.

Compile a Lists of URLs with Screaming Frog

Next I needed a list of URLs for Kimono to scrape. Screaming Frog was the easy solution for this. I had Screaming Frog pull a list of articles from the clients blog, then I plugged those into Kimono. You could also use the page path report from Google Analytics.

Here is what that process looks like:

Map Topics With This GUI Topic Modeling Tool

Many of the topic modeling tools out there require some coding knowledge. However, I was able to find this Topic Modeling Tool housed on code.google.com. The development of this program was funded by the Institute of Museum and Library Services to Yale University, the University of Michigan, and the University of California, Irvine.

The institute’s mission is to create strong libraries and museums that connect people to information and ideas. My mission is to understand how strong my clients content library is and how I can connect them with more people. Perfect match.

Download the program, then:
1. Upload the CSV file from Kimono into the ‘Select Input File or Dir’ field.
2. Select your output directory.
3. Pick the number of topics you would like to have it produce. 10-20 should be fine.
4. If you’re feeling like a badass you can change the advanced settings. More on that below.
5. Click Learn Topics.

Main Topic Modeling Interface
Advanced Settings Interface


Advanced Options
Besides the basic options provided in the first window, there are more advanced parameters that can be set by clicking the Advanced button.

Remove stopwords – If checked, remove a list of “stop words” from the text.

Stopword file – Read “stop words” from a file, one per line. Default is Mallet’s list of standard English stopwords.

Preserve case – If checked, do not force all strings to lowercase.

No. of iterations – The number of iterations of Gibbs sampling to run.
Default is:
– For T500 default iterations = 1000
– Else default iterations = 2*T
Suggestion: Feel free to use the default setting for number of iterations. If you run for more iterations, the topic coherence *may* improve.

No. of topic words printed – The number of most probable words to print for each topic after model estimation. Default is print top-10 words. Typical range is top-10 to top-20 words.

Topic proportion threshold – Do not print topics with proportions less than this threshold value. Good suggested value is 5%. You may want to increase this threshold for shorter documents.

Analyze The Output

The output of this raw data is a list of keywords organized into rows, each row representing a topic. To make analysis easier I transposed these rows into columns. Now I put my marketer hat on and manually highlighted every word in these topics that directly related to services, products, or the industry. That looks something like this:


main-topics-topic-modelingOnce I identified the keywords that most closely related to the client’s industry and offering, I eyeballed several themes that theses keywords could fall under. I found themes related to Repair, Fire, Safety, Building, Home, Environmental, and Cleaning.

Once I had this list, I looked back through each topic column and added the themes I felt best matched the words above each LDA topic. That gave me a range at the top of my LDA topics which I could sum using a countif function in Excel. The result is something to the right.

Obviously this last part is far from scientific. The only thing remotely scientific about this is using Latent Dirichlet Allocation to organize words into topics. However it does provide value. This is a real model rooted in math; I used actual blog content not a list of keywords that came from a brainstorming session and Ubersuggest, and with a little intuition I got an idea of the strengths and weaknesses of my clients blog content.

Cleaning is a very important part of what my client does, yet it does not have much of a presence in this analysis. I have my next blog topic!

Something To Consider

LDA and topic modeling have been around for 11 years now and most search related articles about the topic appear between 2010 and 2012. I am unsure why that is as all of my efforts have been put toward testing the model. Moving forward I will be digging a little deeper to make sure this is something worth perusing. If it is, you can expect me to report on a more scientific application, along with results, in the future.

Getting Started with ShufflePoint



ShufflePoint is a paid application that uses Excel’s built-in “Web Query” function to pull data from Google Analytics into Excel. It is an extremely powerful tool and allows you to take advantage of Excel’s data manipulation abilities. This gives you the freedom to develop compelling visuals that will help you quickly assess the performance of a website. When I was developing my first ShufflePoint report, I found that thinking about and planning data organization took the most time. My hope is that this article will help you graph your Google Analytics data in Excel with as little trial and error as possible.

Technical and Webmaster Guidelines for HTTPS


Spurred on by the Edward Snowden revelations, Google has begun taking security more seriously. After the revelations came out, Google quickly secured and patched their own weaknesses. Now they are pushing to encrypt all internet activity by incentivizing websites that use SSL certificates by giving them a boost in rankings.

During a Google I/O presentation this year called HTTPS Everywhere, speakers Ilya Grigorik and Pierre Far made it clear that this move is not just about encrypting the data being passed between server to browser, but also to protect users from having the meta data surrounding those requests collected.

Though the meta data collected by visiting a single unencrypted website is benign, when you aggregate that data it can pose serious security risk for the user. Thus by incentivizing HTTPS, Google has begun to eliminate instances on the web where users could be vulnerable to having information unknowingly collected about them.

I will give you the spark notes version of the HTTPS Everywhere presentation, but even that will warrant a TL;DR stamp. My hope is that this outline and the resource links contained within it give you a hub you can use when evaluating and implementing HTTPS on your site. (more…)

MozCon 2014 Recap: What I Learned


A holistic industry transformation was the tone at MozCon this year and Erica McGillivray and team did a fantastic job getting speakers that supported this theme. Those chosen for the conference are experts in their fields, pushing conventional wisdom and challenging us with new ways to tackle old problems. Each spoke on different topics, but to the same point.

MozCon started with a presentation from our fearless SEO leader, the Wizard of Moz himself, Rand Fishkin. Rand started off the conference by reflecting on the past year in search and framing his vision for the future. He highlighted 5 big trends from the past year.


11 People to Watch at MozCon 2014

MozCon 2014MozCon is a three day marketing conference put on by Moz.com. The conference brings together next-level speakers to talk about everything from SEO to brand development to analytics. This year Erica McGillivray and team will bring 29 speakers to the Emerald City to give their expert opinions on the future of marketing. It is a jam packed three days, so I have outlined eleven of the people I am most excited to see along with some of their own reasons you should watch them.

When: July 14-16, 2014
Where: Seattle,Washington


Free Excel Workbook for Analyzing Screaming Frog Data

Workbook updated on 10/29/14 with the following features:
- A cleaner style that makes reading the dashboard easier
- A new area in the workbook for outlining the top 5 takeaways from the data
- Better, consolidated visualizations makes spotting issues faster
- Space added to insert client logo

We here at LunaMetrics are born from data and to data we return time and time again to uncover insights and craft strategy. But staring at large sets of data is a mind numbing process, one I personally hate. So when I began performing health checks for large websites I immediately starting thinking about how I could eliminate as much work as possible. Using some Excel magic, many Mr. Excel videos, and data pulled from Screaming Frog I created a simple copy & paste workbook that counts, totals, and visualizes all the data Screaming Frog gives you.

Shout out to Dan Sharp of Screaming Frog for his great feedback on this workbook. Keep an eye out for Screaming Frogs new version being released in the next couple weeks. The big addition? Data visualization. Can’t wait for that!

Free Excel Screaming Frog Analysis Spreadsheet


Webmaster Tools Site Versions: It Just Works.

Google recently announced that they added more precise data in webmaster tools. The announcement highlighted the ability to track secure (https) versions of a site and also cast a light on a webmaster’s ability to track subdomains and subdirectories.

In their documentation Google provided examples of the types of URLs you can add.

• http://example.com/
• https://example.com/
• ftp://ftp.example.com/
• http://bar.example.com/
• http://foo.bar.example.com/
• http://www.example.com/foo/
• http://www.example.com/foo/bar/
• http://foo.bar.example.com/catalog/dresses/


Implementing Google Authorship

What is rel=”author”?

Part of the HTML5 spec, rel=”author” can be added to any <link>, <a> or <area> tag to inform search engines that the other end of the author link represents the author of the piece of content it is crawling.

In 2011 Google began using rel=”author” in an attempt to understand authorship of content more broadly. There has been some turbulence in the SEO community over whether Google will actually be using this to rank content in future. But Google’s Matt Cutts has most recently stated that Google is using rel=”author” as part of an Author Rank when serving in depth articles in their search results. Thus it is important to know you have this set up properly on your website. (more…)

Resources, On Resources, On Design Resources

Hmm…I wonder what this post is about.

You don’t spend endless hours online without coming across some great websites that the average person wouldn’t find. I have folders and folders of these sites and thought offloading some of them into a blog post would be useful for some. The following resources focus on design: free stock image sites, free vector graphic sites, color generates, pattern generators, and a couple great free font and data visualization websites.

2014 Web Design Trends: All Simple Everything

Focused Design


“A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away.” –Antoine de Saint-Exupery

Keep it simple stupid. Moderation is the key to a good life and good 2014 UX. Flat design is about removing all the unnecessary flare and letting the what you have to say be the main focus. In the past there has been a gap between the world the designers are living in (Photoshop) and the developers were living in (text editors). Browsers are getting better about what they can and cannot do from a color and design(svg) perspective, which used to bottleneck the creativity of the designers. This gap will begin to close as the adoption of flat design provides designers with more incentive and necessity to design within the browser. Because of this browser focused design, the capabilities of the developer will be paramount to the execution of the designer’s vision. The developers ability to understand not only how to code the design, but understand how that design will translate across browsers and devices will make finding the designer-developer team as important as either one of their capabilities. Will we see the first crowning of a Developer-Designer team crowned as the king of 2014 web design?