Keyword Analysis by Number of Terms (and the RegEx that helps)


Do long search phrases convert better?

This was what I wanted to find out for a particular client, but it took some work. I used a regular expression in the Keywords Report of Google Analytics to filter by the number of terms in the Keyword Phrase. The exported results showed a clear increase in conversion rate as the number of search terms increased.

This client was doing far better with searchers who were using a lot of terms. They were being specific! They knew just what they were looking for and were ready to buy. This data put additional power behind recommendations concerning content, search engine optimization and paid search strategies.

1 .59%

2 .60%

3 .90%

4 1.17%

5 1.06%

6 1.22%

7 1.88%

8 3.33%



Even though there were a lot of people using long search phrases, this data was obscured. As the number of terms increased, the number of people searching for exactly that phrase decreased. This resulted in none of the individual phrases seeming to count for much. The so-called Long Tail.

You really have to dig to find these sorts of gems but they are invaluable in the pursuit of providing information that can be acted upon.

A tool for digging

The tool is a Regular Expression, a pattern matching language. If you’re not already familiar with it, there is a great series of articles right here on the LunaMetrics blog.

Here is what I used:

It accounts for the most common characters I’ve found between words.

Steve (see comments) pointed out a great way to shorten my expression by using the W character set. Here is what it looks like.

W is shorthand for all non-word characters

How do I use it?


I know this may look like gibberish but keep reading — you don’t need to understand it to get some use from it.

In Google Analytics, go to Traffic Sources > Keywords and paste the Regular Expression into the box at the bottom of the data. Just change the {3} to whatever number of terms you want to see and click the GO button.

A brief look at the RegEx

Although this is not strictly a Regular Expression post, I feel obligated to include a basic glance at the different parts of the expression. Feel free to skip this if you just don’t care.

^ anchors the beginning of the match to the beginning of the string

( ) used to group a set of items together for a match

[+*”*s*,*’*-*]* This group matches any number and any order of + ” , – ‘ and whitepace (s). It is what handles all the characters that might end up separating different search terms.

w+ Matches 1 or more alphanumeric characters (the w is another pre-defined set of characters like s)

b Match for a word boundary. It forces the w characters to be separated by something. Otherwise the expression will match any string of characters longer than {3}.

{3} Requires exactly 3 of the above sequence so it would match the phrase one two three but not one two three four

$ anchors the end of the match to the end of the string


Don’t Sweat the Small Stuff

You can’t account for every situation. For example, sometimes ‘ is meant as an apostrophe and sometimes “-” is used as a hyphen. In the end the impact is usually small – just 2-3% of the search phrases were affected in my case and they just get bumped to the next higher match instead. (For example, non-glare window would match at {3} instead of {2})

It is an interesting way to look at keyword data and maybe you’ll get some use from it– if you do, let me know.


John is a former LunaMetrician and contributor to our blog.

  • Tim Leighton-Boyce

    Fascinating. And what an inspiring use of a regex.

    It sets my mind spinning away: I wonder if it would be possible to contrive a filter to replace the actual keywords with just the number of words in the phrase? That way you could have a profile which showed a summary form of the information — which might be useful to allow tracking of any changes over time without all the hard work of exporting and compiling the information.

  • Steve

    Great Post John! I’d not known you could filter like this! Way cool!!!! (too many exclamation marks???)

    BTW, and because Robbin would be disappointed if I didn’t. 😉


    Is a simpler variant. Even that’s wrong tho, as it matches things like:
    “crohn’s disease”
    To pick the obvious example from GA on work’s site.

    This works as the {3} … simply … acts like a multiplier. ie:

    But wait. There’s more! You can simplify even further.

    Where W == [^w] in essence.
    Still haven’t solved for “crohn’s disease”, but it’s 6.30am and I’ve not had coffee yet. 😀

    The b is needed – as it can match the $ and hence why I’ve removed same, but also the “wrap” from the {3}, with the next W*, you can match zero characters, so you can get w+w+w+, which is NOT what is wanted. Always be careful of the asterix, it can easily break things by simply not matching anything – the zero case.

    FWIW. Using the negated match as a pre/post-pender is a very obvious way to more or less match… everything. You don’t end up with any subtle cases that drop off.

    Also, you don’t need to escape all characters inside []. Or put multipliers (eg *) against them. Your [] construct:
    is equivalent to:
    ie. You’ve included the ‘*’ as a character to match. Possibly not what was desired?

    Generally it’s Bad Practice to escape characters that don’t need it. People start expecting special meanings where none exist and that’s not a good idea for the person coming after you. If/But/Maybe circumstances? Certainly! 🙂

    Cheers! And thanks again for a great heads up!
    – Steve

    PS w is *usually* equivalent to [a-zA-Z0-9_] Yup, Includes an underscore. Be Ware of your assumptions! 🙂

  • Steve

    I did not realize that you didn’t have to escape inside []. And great examples of shortening up my unwieldy expressions, I didn’t even think to use W !! However they don’t quite work. For example, ^(W*w+b){3} will find a match with: “one two three four” so you need the $ anchor at the end. And once you put the $ anchor, you need to have your non-word characters on the right side too, to catch stuff after the last word. So what we probably need is something like ^(W*w+bW*){3}$ instead of ^(W*w+b){3}.

  • Steve

    Argggh! The “$”! Of course it’s still needed. My thanks for correcting me!
    If I was hunting for an excuse I’d blame it on a lack of coffee. Sadly that while a caffeine deficiency event was happening, I simply got it wrong. 🙂

    Hmm. Yes agreed. Good point!
    Tho I suspect you could/should ditch the b pre the W* then.
    W* … *should* pick that up anyway. (After shifting dirt all day long, I’m too tired and sore to swing round in my chair and read my regex book. 🙂 )

    Ahhh! Of course! Better:


    It might even be worthwhile having a really good detailed look at the data coming in – you *may* be able to pull the lead W* outside the braces after the caret:

    Which is preferable from an efficiency perspective.
    Though this will break (ie ignore) on the use of single quotes eg the example I gave previously.

    Make sense?

    – Steve

  • Anybody have a way for not counting apostrophes as word delimiters? It’s a major issue with the site I tested this on. –dave

  • David,

    I believe that is possible to take into account. I never tried since in my case it was not very important. Try the following RegEx and let me know if it works for you. It accounts for both a – and a ‘ in the word:


    It uses a Negative Lookahead and says “only match a boundary if [-‘]?w does NOT match after it”.
    The [-‘] means match a – or a ‘ [-‘]? says match it if it’s there but it’s optional.

    So for “John’s mini-bike” it hits a b after the first n, but checks what is in front. it’s an ‘ followed by a w, so it’s not allowed to match the b. then it comes to another b after the ‘, checks what is in front of it, and finds a w. since the [-‘] is optional thanks to the [-‘]? it still matches, preventing the b from matching the word boundary. Then it sees a b after the s. It looks ahead and finds a space. The Lookahead part does not match that, and the [w-‘] CAN’T match it, so the b matches it, and we go to the next repetition.

    I plan to cover another really great use of the Negative Lookahead in another post.

    By the way, there are also Lookbehinds … but GA does not support them as far as I can tell.

    If that doesn’t seem to work, or if someone has a better way, I’d love to hear it.

    BEWARE: don’t copy and paste it from these comments, since wordpress will probably convert the apostrophe into a curley single quote. Just type it into Notepad by hand, and use that.

  • Thanks John,
    Very interesting & useful. I did a quick run through our results for 2007 and discovered that roughly 87% of our traffic and 87% of our sales came in in 3 word phrases or less. with about 60% contributed by 2 word phrases, which is also where our conversion rate peaked.

    If I had guessed at the results before running this, my guesses would have been way off. I would have figured that our peak would have been at about 4 keyword phrases and more heavily weighted towards longer (brand and model specific) terms.


  • Another interesting take on this topic…


  • Pingback: Rethinking your strategy, and why the long tail converts better.()

  • Hi,

    Thanks for this tip. Now for a Danish customer I need a regular expression that also include the Danish charaters æ, ø and å. And for this particular customer also the swedish character ö.

    If I use this one: ^([+*”*s*,*’*-*]*w+bs*[+*”*s*,*’*-*]*){3}$

    these characters are not included.

    If I use this one: ^(W*w+bW*){3}$

    these characters are included, but they seem to count as spaces so that for example the word “hængelÃ¥s” (Danish word for padlock) is listed as a three word phrase.

    Let me know if you have a solution for this – it will help me a lot!

    Michael Dalmer

  • Google Analytics supports the p unicode character sets such as

    p{L} any letter
    p{N} any number
    p{P} any punctuation
    p{Z} any whitespace

    you can negate a character set with ^ inside the {} such as

    p{^L} any non-letter

    So the equivalent to w should be [p{L}p{N}] or any letter or any number

    So that lets us handle any part of the expression except the b, the word break. I’m not sure what the equivalent is here.

    Note, I have no experience with the p stuff myself, and I don’t know how it would actually play out or how close you could come to getting it to work. This is just a starting point

  • My 3rd grade English teacher would argue that in order for a string to contain 2 words, it must contain a space between the
    two words. In this case, it’s as simple as:
    .+ .+
    one word:
    change filter to exclude and hit the space bar
    or three words:
    .+ .+ .+
    Sure, you’ll get some funky outliers… but you’ll also probably be able to memorize the regex. This would include numbers and other characters as well, but I would also argue that numbers are words in this context. If you don’t think numbers are words, then just use John’s example above to construct:
    p{L}+ p{L}+

    Actually, this whole counting keywords thing seems a little numerology heavy. I think what really contributes to the higher conversion rates are the combination of a brand term with a relevant topic / item. For example, someone landing on this site having searched “luna metrics blue widgets philadelphia” is certainly going to convert higher, as they are measurably both brand aware and purchase intending for those fancy blue widgets and localized. You’re gonna need some really fancy regex to measure that 🙂

  • Pingback: Search Term Categorisation in Google Analytics | Web analytics consultancy and expert Google analytics | L3 Analytics()

  • there are also Lookbehinds … but GA does not support them as far as I can tell.

  • This works as the {3} … simply … acts like a multiplier. ie:

  • yes.but they are invaluable in the pursuit of providing information that can be acted upon.

Contact Us.

Follow Us



We'll get back to you
in ONE business day.
Our Locations
THE FOUNDRY [map] LunaMetrics

24 S. 18th Street
Suite 100

Pittsburgh, PA 15203


4115 N. Ravenswood
Suite 101
Chicago, IL 60613


2100 Manchester Rd.
Building C, Suite 1750
Wheaton, IL 60187