Generating Keyword Clusters – The Art and Science of Analysis, Part 3


In parts one and two, I wrote about using both sides of your brain to do analysis and walked through a simple example of analysis. Now I’d like to turn to something complex, or at least with the potential for complexity: keyword analysis.

cereal, scatterplot, molecules, crowd

Keywords can be a rich source of visitor intent. I’m talking about search queries that lead to visits, as well as terms entered in site search after visitors arrive.

But looking at the top 100 or even top 1,000 keywords (ranked by your favorite metric: bounce rate, conversion rate, or whatever you like) won’t necessarily lead to the most accurate analysis because it neglects information in the long tail, which may be on the order of tens of thousands or more keywords.

If you’ve spent any time examining keyword data, you’ve observed similar terms dispersed throughout the long tail. I want to group those terms and analyze each group’s aggregated data to give a more complete picture. So what’s the best way to do that?

The Answer: Keyword Clusters

Of course, I’m not the first person to propose that analyzing groups or clusters of keywords can lead to more valuable insight than analyzing individual keywords alone.

In January, AJ Kohn wrote about his method for creating keyword rank indexes by exporting a CSV file with keyword rank history and leveraging pivot tables in Excel. Together with Justin Cutroni, he describes how to use event tracking to put keyword rank data into Google Analytics.

Another article from a couple years ago describes clustering keywords by their performance on various metrics. The author mentions (but doesn’t go into detail about) using tools like SPSS or SAS to do the cluster analysis and come up with related groups of terms.

And recently SEOmoz published an article about tracking SEO ‘broad match’ keywords in Google Analytics. Author Tracy Mu creates keyword clusters using regular expressions and then saves them as advanced segments. She then applies four segments at a time to custom reports and, really smartly, saves those reports as GA shortcuts.

The Next Question: Linguistic Complexity

All of those techniques are interesting and useful, but not quite what I’m looking for. The first two methods group keywords by a non-linguistic feature such as rank or performance. What keywords are in those groups? Still individual keywords dispersed across a slightly shorter long tail.

The last method, borrowing the idea of broad match from paid search, does what I want but with a limited number of clusters. The other drawback (for me) is that I don’t want to guess which keyword clusters to create. That’s a little too much art and not enough science.

What I really want to do is apply text analytics methods to discover patterns in keyword data, related to the semantic domain of the customer, and create related keyword groups automatically. This would account for linguistic complexity in all the forms actually produced by site visitors, a seemingly endless variety of word choices, phrasing, and spelling.

Text Analytics to the Rescue

I found someone else looking for the same thing in a question on Stack Overflow about using Python to cluster search engine keywords. The tricky part, as suggested in the question, is developing a domain-specific word source rather than relying on a more generally-informed source like WordNet.

One way to develop a customized word source, or “topic library”, is to mine web content related to the customer’s industry, and then cross-reference it with a database of phrases (such as the customer’s actual keywords). This allows for identification of phrases that will be treated as one word, as well as proper nouns and acronyms that may be specific to the customer’s products or services.

I’m planning to combine the customized topic library with a tool like the Python Natural Language Toolkit to create keyword clusters for better analysis. I’ll keep you updated on the results.

What tools and techniques have you used for grouping keywords for analysis? Please share in the comments.

Dorcas Alexander is a Manager for the Analytics & Insight department. Her path to LunaMetrics followed stints in ad agency creative, math, and computer science. Dorcas has a master's degree in language and information technologies from Carnegie Mellon University, where she helped build precursors to a Universal Translator. One of the top-rated tournament Scrabble players in Pennsylvania, Dorcas has an insatiable drive to compete and win.

  • Sean

    About 2 years ago I scraped the top 100 results off of Google for roughly 30K keywords and built a matrix of keyword x website. Threw that into R’s hclust function, and about 3 days later I had a dendrogram of what Google thought of the keywords.

    It wasn’t too bad, a lot of work needs to be done on figuring the “distance” between two keywords since I just learning this stuff as I went. But I figure Google knows which keywords are related, and if it shows similar results for two keywords, that must be an indicator, no?

    • Dorcas Alexander

      Hi Sean, I thought I might end up using R eventually, but wanted to give the Python Toolkit a try first. And yes, figuring the distance between keywords is another potentially time-intensive part of the process, and probably requires some trial-and-error experimentation. I would also guess that Google showing similar results for two keywords is an indicator of shorter distance. Thanks for your comments!

  • Dorcas, Great to see a post on this.
    Since forever I have been looking for a way to group keywords semantically. Long tail is great but I have never been able to find a way to group themes in say 10/20k long tail Keywords.
    Thanks for a nice round up of info. Having tried all kinds of tools, one that I have found ( a bit useful) that uses textual analysis is Open Calais.
    I await your update with interest…

    • Dorcas Alexander

      Thanks for your comments, Seamus. I took at quick look at Open Calais and it seems like something I might want to try. With all the tools I try, it’s a matter of finding the right balance between letting the tool do the work and having more control (meaning having to do more work myself).

  • Raymond

    I just come across this post. Is there any update on your keyword analysis? Thanks

Contact Us.

Follow Us



We'll get back to you
in ONE business day.
Our Locations
THE FOUNDRY [map] LunaMetrics

24 S. 18th Street
Suite 100

Pittsburgh, PA 15203


4115 N. Ravenswood
Suite 101
Chicago, IL 60613


2100 Manchester Rd.
Building C, Suite 1750
Wheaton, IL 60187