Generating Keyword Clusters – The Art and Science of Analysis, Part 3/
April 4, 2013
In parts one and two, I wrote about using both sides of your brain to do analysis and walked through a simple example of analysis. Now I’d like to turn to something complex, or at least with the potential for complexity: keyword analysis.
Keywords can be a rich source of visitor intent. I’m talking about search queries that lead to visits, as well as terms entered in site search after visitors arrive.
But looking at the top 100 or even top 1,000 keywords (ranked by your favorite metric: bounce rate, conversion rate, or whatever you like) won’t necessarily lead to the most accurate analysis because it neglects information in the long tail, which may be on the order of tens of thousands or more keywords.
If you’ve spent any time examining keyword data, you’ve observed similar terms dispersed throughout the long tail. I want to group those terms and analyze each group’s aggregated data to give a more complete picture. So what’s the best way to do that?
The Answer: Keyword Clusters
Of course, I’m not the first person to propose that analyzing groups or clusters of keywords can lead to more valuable insight than analyzing individual keywords alone.
In January, AJ Kohn wrote about his method for creating keyword rank indexes by exporting a CSV file with keyword rank history and leveraging pivot tables in Excel. Together with Justin Cutroni, he describes how to use event tracking to put keyword rank data into Google Analytics.
Another article from a couple years ago describes clustering keywords by their performance on various metrics. The author mentions (but doesn’t go into detail about) using tools like SPSS or SAS to do the cluster analysis and come up with related groups of terms.
And recently SEOmoz published an article about tracking SEO ‘broad match’ keywords in Google Analytics. Author Tracy Mu creates keyword clusters using regular expressions and then saves them as advanced segments. She then applies four segments at a time to custom reports and, really smartly, saves those reports as GA shortcuts.
The Next Question: Linguistic Complexity
All of those techniques are interesting and useful, but not quite what I’m looking for. The first two methods group keywords by a non-linguistic feature such as rank or performance. What keywords are in those groups? Still individual keywords dispersed across a slightly shorter long tail.
The last method, borrowing the idea of broad match from paid search, does what I want but with a limited number of clusters. The other drawback (for me) is that I don’t want to guess which keyword clusters to create. That’s a little too much art and not enough science.
What I really want to do is apply text analytics methods to discover patterns in keyword data, related to the semantic domain of the customer, and create related keyword groups automatically. This would account for linguistic complexity in all the forms actually produced by site visitors, a seemingly endless variety of word choices, phrasing, and spelling.
Text Analytics to the Rescue
I found someone else looking for the same thing in a question on Stack Overflow about using Python to cluster search engine keywords. The tricky part, as suggested in the question, is developing a domain-specific word source rather than relying on a more generally-informed source like WordNet.
One way to develop a customized word source, or “topic library”, is to mine web content related to the customer’s industry, and then cross-reference it with a database of phrases (such as the customer’s actual keywords). This allows for identification of phrases that will be treated as one word, as well as proper nouns and acronyms that may be specific to the customer’s products or services.
I’m planning to combine the customized topic library with a tool like the Python Natural Language Toolkit to create keyword clusters for better analysis. I’ll keep you updated on the results.
What tools and techniques have you used for grouping keywords for analysis? Please share in the comments.