<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Keyword Analysis by Number of Terms (and the RegEx that helps)</title>
	<atom:link href="http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/</link>
	<description>LunaMetric's blog on conversion rate and web analytics</description>
	<pubDate>Thu, 21 Aug 2008 18:17:01 +0000</pubDate>
	<generator>http://wordpress.org/</generator>
		<item>
		<title>By: John</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1243</link>
		<dc:creator>John</dc:creator>
		<pubDate>Sun, 09 Mar 2008 20:51:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1243</guid>
		<description>Google Analytics supports the \p unicode character sets such as

\p{L} any letter
\p{N} any number
\p{P} any punctuation
\p{Z} any whitespace

you can negate a character set with ^ inside the {} such as

\p{^L} any non-letter


So the equivalent to \w should be [\p{L}\p{N}] or any letter or any number

So that lets us handle any part of the expression except the \b, the word break.  I'm not sure what the equivalent is here.

Note, I have no experience with the \p stuff myself, and I don't know how it would actually play out or how close you could come to getting it to work. This is just a starting point</description>
		<content:encoded><![CDATA[<p>Google Analytics supports the \p unicode character sets such as</p>
<p>\p{L} any letter<br />
\p{N} any number<br />
\p{P} any punctuation<br />
\p{Z} any whitespace</p>
<p>you can negate a character set with ^ inside the {} such as</p>
<p>\p{^L} any non-letter</p>
<p>So the equivalent to \w should be [\p{L}\p{N}] or any letter or any number</p>
<p>So that lets us handle any part of the expression except the \b, the word break.  I&#8217;m not sure what the equivalent is here.</p>
<p>Note, I have no experience with the \p stuff myself, and I don&#8217;t know how it would actually play out or how close you could come to getting it to work. This is just a starting point</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Dalmer</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1242</link>
		<dc:creator>Michael Dalmer</dc:creator>
		<pubDate>Sun, 09 Mar 2008 17:55:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1242</guid>
		<description>Hi,

Thanks for this tip. Now for a Danish customer I need a regular expression that also include the Danish charaters Ã¦, Ã¸ and Ã¥. And for this particular customer also the swedish character Ã¶.

If I use this one: ^([\+*"*\s*,*'*\-*]*\w+\b\s*[\+*"*\s*,*'*\-*]*){3}$ 

these characters are not included. 

If I use this one: ^(\W*\w+\b\W*){3}$ 

these characters are included, but they seem to count as spaces so that for example the word "hÃ¦ngelÃ¥s" (Danish word for padlock) is listed as a three word phrase.

Let me know if you have a solution for this - it will help me a lot!

Thanks, 
Michael Dalmer</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>Thanks for this tip. Now for a Danish customer I need a regular expression that also include the Danish charaters Ã¦, Ã¸ and Ã¥. And for this particular customer also the swedish character Ã¶.</p>
<p>If I use this one: ^([\+*"*\s*,*'*\-*]*\w+\b\s*[\+*"*\s*,*'*\-*]*){3}$ </p>
<p>these characters are not included. </p>
<p>If I use this one: ^(\W*\w+\b\W*){3}$ </p>
<p>these characters are included, but they seem to count as spaces so that for example the word &#8220;hÃ¦ngelÃ¥s&#8221; (Danish word for padlock) is listed as a three word phrase.</p>
<p>Let me know if you have a solution for this - it will help me a lot!</p>
<p>Thanks,<br />
Michael Dalmer</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rethinking your strategy, and why the long tail converts better.</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1234</link>
		<dc:creator>Rethinking your strategy, and why the long tail converts better.</dc:creator>
		<pubDate>Thu, 06 Mar 2008 19:10:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1234</guid>
		<description>[...] it a good idea to go after it? Probably not. Theres a great post over at Lunametics that proves why longer keyword phrases convert better. For example if someone searches for a &#8220;Dell computer&#8221;, they are probably researching [...]</description>
		<content:encoded><![CDATA[<p>[...] it a good idea to go after it? Probably not. Theres a great post over at Lunametics that proves why longer keyword phrases convert better. For example if someone searches for a &#8220;Dell computer&#8221;, they are probably researching [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg Moore</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1079</link>
		<dc:creator>Greg Moore</dc:creator>
		<pubDate>Thu, 03 Jan 2008 20:14:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1079</guid>
		<description>Another interesting take on this topic...
http://www.isearchmedia.com/articles/Keyword_Length_and_Conversion_Rate.php

Cheers!</description>
		<content:encoded><![CDATA[<p>Another interesting take on this topic&#8230;<br />
<a href="http://www.isearchmedia.com/articles/Keyword_Length_and_Conversion_Rate.php" rel="nofollow">http://www.isearchmedia.com/articles/Keyword_Length_and_Conversion_Rate.php</a></p>
<p>Cheers!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Loren Hadley</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1078</link>
		<dc:creator>Loren Hadley</dc:creator>
		<pubDate>Thu, 03 Jan 2008 19:07:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1078</guid>
		<description>Thanks John,  
Very interesting &#38; useful.  I did a quick run through our results for 2007 and discovered that roughly 87% of our traffic and 87% of our sales came in in 3 word phrases or less. with about 60% contributed by 2 word phrases, which is also where our conversion rate peaked. 

If I had guessed at the results before running this, my guesses would have been way off.  I would have figured that our peak would have been at about 4 keyword phrases and more heavily weighted towards longer (brand and model specific) terms. 

Thanks, 
Loren</description>
		<content:encoded><![CDATA[<p>Thanks John,<br />
Very interesting &amp; useful.  I did a quick run through our results for 2007 and discovered that roughly 87% of our traffic and 87% of our sales came in in 3 word phrases or less. with about 60% contributed by 2 word phrases, which is also where our conversion rate peaked. </p>
<p>If I had guessed at the results before running this, my guesses would have been way off.  I would have figured that our peak would have been at about 4 keyword phrases and more heavily weighted towards longer (brand and model specific) terms. </p>
<p>Thanks,<br />
Loren</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1077</link>
		<dc:creator>John</dc:creator>
		<pubDate>Thu, 03 Jan 2008 17:58:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1077</guid>
		<description>David,

I believe that is possible to take into account. I never tried since in my case it was not very important. Try the following RegEx and let me know if it works for you. It accounts for both a - and a ' in the word:

^(\W*[\w-']+((?!['-]?\w)\b)\W*){4}$

It uses a Negative Lookahead and says "only match a boundary if [-']?\w does NOT match after it".
The [-'] means match a - or a ' [-']? says match it if it's there but it's optional.

So for "John's mini-bike" it hits a \b after the first n, but checks what is in front. it's an ' followed by a \w, so it's not allowed to match the \b. then it comes to another \b after the ', checks what is in front of it, and finds a \w. since the [-'] is optional thanks to the [-']? it still matches, preventing the \b from matching the word boundary. Then it sees a \b after the s. It looks ahead and finds a space. The Lookahead part does not match that, and the [w-'] CAN'T match it, so the \b matches it, and we go to the next repetition.

I plan to cover another really great use of the Negative Lookahead in another post.

By the way, there are also Lookbehinds ... but GA does not support them as far as I can tell.

If that doesn't seem to work, or if someone has a better way, I'd love to hear it. 

BEWARE: don't copy and paste it from these comments, since wordpress will probably convert the apostrophe into a curley single quote. Just type it into Notepad by hand, and use that.</description>
		<content:encoded><![CDATA[<p>David,</p>
<p>I believe that is possible to take into account. I never tried since in my case it was not very important. Try the following RegEx and let me know if it works for you. It accounts for both a - and a &#8216; in the word:</p>
<p>^(\W*[\w-']+((?!['-]?\w)\b)\W*){4}$</p>
<p>It uses a Negative Lookahead and says &#8220;only match a boundary if [-']?\w does NOT match after it&#8221;.<br />
The [-'] means match a - or a &#8216; [-']? says match it if it&#8217;s there but it&#8217;s optional.</p>
<p>So for &#8220;John&#8217;s mini-bike&#8221; it hits a \b after the first n, but checks what is in front. it&#8217;s an &#8216; followed by a \w, so it&#8217;s not allowed to match the \b. then it comes to another \b after the &#8216;, checks what is in front of it, and finds a \w. since the [-'] is optional thanks to the [-']? it still matches, preventing the \b from matching the word boundary. Then it sees a \b after the s. It looks ahead and finds a space. The Lookahead part does not match that, and the [w-'] CAN&#8217;T match it, so the \b matches it, and we go to the next repetition.</p>
<p>I plan to cover another really great use of the Negative Lookahead in another post.</p>
<p>By the way, there are also Lookbehinds &#8230; but GA does not support them as far as I can tell.</p>
<p>If that doesn&#8217;t seem to work, or if someone has a better way, I&#8217;d love to hear it. </p>
<p>BEWARE: don&#8217;t copy and paste it from these comments, since wordpress will probably convert the apostrophe into a curley single quote. Just type it into Notepad by hand, and use that.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Lenef</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1075</link>
		<dc:creator>David Lenef</dc:creator>
		<pubDate>Thu, 03 Jan 2008 15:59:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1075</guid>
		<description>Anybody have a way for not counting apostrophes as word delimiters? It's a major issue with the site I tested this on. --dave</description>
		<content:encoded><![CDATA[<p>Anybody have a way for not counting apostrophes as word delimiters? It&#8217;s a major issue with the site I tested this on. &#8211;dave</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1073</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Thu, 03 Jan 2008 08:56:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1073</guid>
		<description>Argggh! The "$"! Of course it's still needed. My thanks for correcting me!
If I was hunting for an excuse I'd blame it on a lack of coffee. Sadly that while a caffeine deficiency event was happening, I simply got it wrong. :-)

^(\W*\w+\b\W*){3}$
Hmm. Yes agreed. Good point!
Tho I suspect you could/should ditch the \b pre the \W* then.
\W* ... *should* pick that up anyway. (After shifting dirt all day long, I'm too tired and sore to swing round in my chair and read my regex book. :-) )

Ahhh! Of course! Better:
^(\W*\w+\b){3}\W*$

???

It might even be worthwhile having a really good detailed look at the data coming in - you *may* be able to pull the lead \W* outside the braces after the caret:
^\W*(\w+\b){3}\W*$

Which is preferable from an efficiency perspective.
Though this will break (ie ignore) on the use of single quotes eg the example I gave previously.

Make sense?

Cheers!
- Steve</description>
		<content:encoded><![CDATA[<p>Argggh! The &#8220;$&#8221;! Of course it&#8217;s still needed. My thanks for correcting me!<br />
If I was hunting for an excuse I&#8217;d blame it on a lack of coffee. Sadly that while a caffeine deficiency event was happening, I simply got it wrong. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>^(\W*\w+\b\W*){3}$<br />
Hmm. Yes agreed. Good point!<br />
Tho I suspect you could/should ditch the \b pre the \W* then.<br />
\W* &#8230; *should* pick that up anyway. (After shifting dirt all day long, I&#8217;m too tired and sore to swing round in my chair and read my regex book. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> )</p>
<p>Ahhh! Of course! Better:<br />
^(\W*\w+\b){3}\W*$</p>
<p>???</p>
<p>It might even be worthwhile having a really good detailed look at the data coming in - you *may* be able to pull the lead \W* outside the braces after the caret:<br />
^\W*(\w+\b){3}\W*$</p>
<p>Which is preferable from an efficiency perspective.<br />
Though this will break (ie ignore) on the use of single quotes eg the example I gave previously.</p>
<p>Make sense?</p>
<p>Cheers!<br />
- Steve</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1070</link>
		<dc:creator>John</dc:creator>
		<pubDate>Wed, 02 Jan 2008 20:30:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1070</guid>
		<description>Steve

I did not realize that you didn't have to escape inside []. And great examples of shortening up my unwieldy expressions, I didn't even think to use \W !! However they don't quite work. For example, ^(\W*\w+\b){3} will find a match with: "one two three four" so you need the $ anchor at the end. And once you put the $ anchor, you need to have your non-word characters on the right side too, to catch stuff after the last word. So what we probably need is something like ^(\W*\w+\b\W*){3}$ instead of ^(\W*\w+\b){3}.</description>
		<content:encoded><![CDATA[<p>Steve</p>
<p>I did not realize that you didn&#8217;t have to escape inside []. And great examples of shortening up my unwieldy expressions, I didn&#8217;t even think to use \W !! However they don&#8217;t quite work. For example, ^(\W*\w+\b){3} will find a match with: &#8220;one two three four&#8221; so you need the $ anchor at the end. And once you put the $ anchor, you need to have your non-word characters on the right side too, to catch stuff after the last word. So what we probably need is something like ^(\W*\w+\b\W*){3}$ instead of ^(\W*\w+\b){3}.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1069</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Wed, 02 Jan 2008 19:57:59 +0000</pubDate>
		<guid isPermaLink="false">http://www.lunametrics.com/blog/2008/01/02/keyword-analysis-by-number-of-terms-and-the-regex-that-helps/#comment-1069</guid>
		<description>Great Post John! I'd not known you could filter like this! Way cool!!!! (too many exclamation marks???)

BTW, and because Robbin would be disappointed if I didn't. ;-)

^([\b-"+',]*\w+\b){3}

Is a simpler variant. Even that's wrong tho, as it matches things like:
"crohn's disease"
To pick the obvious example from GA on work's site.

This works as the {3} ... simply ... acts like a multiplier. ie:
^([\b-"+',]*\w+\b)([\b-"+',]*\w+\b)([\b-"+',]*\w+\b)

But wait. There's more! You can simplify even further.
^(\W*\w+\b){3}

Where \W == [^\w] in essence.
Still haven't solved for "crohn's disease", but it's 6.30am and I've not had coffee yet. :-D

The \b is needed - as it can match the $ and hence why I've removed same, but also the "wrap" from the {3}, with the next \W*, you can match zero characters, so you can get \w+\w+\w+, which is NOT what is wanted. Always be careful of the asterix, it can easily break things by simply not matching anything - the zero case.

FWIW. Using the negated match as a pre/post-pender is a very obvious way to more or less match... everything. You don't end up with any subtle cases that drop off.


Also, you don't need to escape all characters inside []. Or put multipliers (eg *) against them. Your [] construct:
[\+*â€*\s*,*â€™*\-*]
is equivalent to:
[+\s"-',*]
ie. You've included the '*' as a character to match. Possibly not what was desired?

Generally it's Bad Practice to escape characters that don't need it. People start expecting special meanings where none exist and that's not a good idea for the person coming after you. If/But/Maybe circumstances? Certainly! :-)


Cheers! And thanks again for a great heads up!
- Steve

PS \w is *usually* equivalent to [a-zA-Z0-9_] Yup, Includes an underscore. Be Ware of your assumptions! :-)</description>
		<content:encoded><![CDATA[<p>Great Post John! I&#8217;d not known you could filter like this! Way cool!!!! (too many exclamation marks???)</p>
<p>BTW, and because Robbin would be disappointed if I didn&#8217;t. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>^([\b-"+',]*\w+\b){3}</p>
<p>Is a simpler variant. Even that&#8217;s wrong tho, as it matches things like:<br />
&#8220;crohn&#8217;s disease&#8221;<br />
To pick the obvious example from GA on work&#8217;s site.</p>
<p>This works as the {3} &#8230; simply &#8230; acts like a multiplier. ie:<br />
^([\b-"+',]*\w+\b)([\b-"+',]*\w+\b)([\b-"+',]*\w+\b)</p>
<p>But wait. There&#8217;s more! You can simplify even further.<br />
^(\W*\w+\b){3}</p>
<p>Where \W == [^\w] in essence.<br />
Still haven&#8217;t solved for &#8220;crohn&#8217;s disease&#8221;, but it&#8217;s 6.30am and I&#8217;ve not had coffee yet. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_biggrin.gif' alt=':-D' class='wp-smiley' /> </p>
<p>The \b is needed - as it can match the $ and hence why I&#8217;ve removed same, but also the &#8220;wrap&#8221; from the {3}, with the next \W*, you can match zero characters, so you can get \w+\w+\w+, which is NOT what is wanted. Always be careful of the asterix, it can easily break things by simply not matching anything - the zero case.</p>
<p>FWIW. Using the negated match as a pre/post-pender is a very obvious way to more or less match&#8230; everything. You don&#8217;t end up with any subtle cases that drop off.</p>
<p>Also, you don&#8217;t need to escape all characters inside []. Or put multipliers (eg *) against them. Your [] construct:<br />
[\+*â€*\s*,*â€™*\-*]<br />
is equivalent to:<br />
[+\s"-',*]<br />
ie. You&#8217;ve included the &#8216;*&#8217; as a character to match. Possibly not what was desired?</p>
<p>Generally it&#8217;s Bad Practice to escape characters that don&#8217;t need it. People start expecting special meanings where none exist and that&#8217;s not a good idea for the person coming after you. If/But/Maybe circumstances? Certainly! <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Cheers! And thanks again for a great heads up!<br />
- Steve</p>
<p>PS \w is *usually* equivalent to [a-zA-Z0-9_] Yup, Includes an underscore. Be Ware of your assumptions! <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
</channel>
</rss>
