<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Regular Expressons Part XII: Bad Greed</title>
	<atom:link href="http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/</link>
	<description>LunaMetric's blog on conversion rate and web analytics</description>
	<pubDate>Thu, 21 Aug 2008 18:07:54 +0000</pubDate>
	<generator>http://wordpress.org/</generator>
		<item>
		<title>By: Nikki</title>
		<link>http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/#comment-1228</link>
		<dc:creator>Nikki</dc:creator>
		<pubDate>Mon, 03 Mar 2008 22:13:49 +0000</pubDate>
		<guid isPermaLink="false">http://72.22.16.69/blog/?p=262#comment-1228</guid>
		<description>Thank you!!! I've been struggling with Google Analytics, trying to come up with an expression that matched 1 term at the beginning of the URL and another term at the end. Problem solved :-) I just wish I'd found your site and its clear examples months ago...</description>
		<content:encoded><![CDATA[<p>Thank you!!! I&#8217;ve been struggling with Google Analytics, trying to come up with an expression that matched 1 term at the beginning of the URL and another term at the end. Problem solved <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> I just wish I&#8217;d found your site and its clear examples months ago&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: LunaMetrics Blog</title>
		<link>http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/#comment-288</link>
		<dc:creator>LunaMetrics Blog</dc:creator>
		<pubDate>Wed, 06 Dec 2006 03:45:00 +0000</pubDate>
		<guid isPermaLink="false">http://72.22.16.69/blog/?p=262#comment-288</guid>
		<description>This is a great way of doing it. BTW, for a custom profile filter in GA, you get the option of telling the filter if case matters.&lt;BR/&gt;&lt;BR/&gt;It is freezing here in Chicago, I would take Australia in a NY minute.&lt;BR/&gt;&lt;BR/&gt;Robbin</description>
		<content:encoded><![CDATA[<p>This is a great way of doing it. BTW, for a custom profile filter in GA, you get the option of telling the filter if case matters.</p>
<p>It is freezing here in Chicago, I would take Australia in a NY minute.</p>
<p>Robbin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/#comment-289</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Wed, 06 Dec 2006 01:36:00 +0000</pubDate>
		<guid isPermaLink="false">http://72.22.16.69/blog/?p=262#comment-289</guid>
		<description>Oh don't get me wrong, your RegEx will get what you're after, it's the fact that it *can* get more that makes it too greedy.&lt;BR/&gt;&lt;BR/&gt;As we've discussed privately - context is *everything*. :-)&lt;BR/&gt;&lt;BR/&gt;If your website only ever has a single level  directory tree, there is no problem.&lt;BR/&gt;If you don't care about intervening directories - ditto.&lt;BR/&gt;&lt;BR/&gt;When I'm trying out a more complex RegEx, I'll usually create a small set (10-20) of test cases to see what matches and what doesn't. Try and cover all possibilities of what should and shouldn't match, and see if I got it right. I usually don't. :-)&lt;BR/&gt;&lt;BR/&gt;So for your original case I'd try:&lt;BR/&gt;&lt;BR/&gt;/mypage/index.htm     (Y)&lt;BR/&gt;/mypage/index.html    (N)&lt;BR/&gt;/MyPage/Index.HTM     (Y?)&lt;BR/&gt;/otherpage/index.htm  (N)&lt;BR/&gt;/mypage/other.htm     (Y)&lt;BR/&gt;mypage/other.htm      (N)&lt;BR/&gt;other.htm             (N)&lt;BR/&gt;other.gif             (N)&lt;BR/&gt;/mypage/index.jpg     (N)&lt;BR/&gt;/mypage/lower/index.htm  (N?)&lt;BR/&gt;/mypage               (N)&lt;BR/&gt;/mypage/              (N)&lt;BR/&gt;/mypage/htm           (N)&lt;BR/&gt;&lt;BR/&gt;/mypage/index.htm?id=1234  (N, but should it?)&lt;BR/&gt;&lt;BR/&gt;&lt;BR/&gt;where the bracketed Y or N is not included in the test data, but shows what I'd expect to see match or not.&lt;BR/&gt;Some of these are self evident "Won't Work", but it's always good to verify. I speak from shoddy RegEx self-construction experience. :-)&lt;BR/&gt;&lt;BR/&gt;&lt;BR/&gt;What you don't match is just as important as what you do.&lt;BR/&gt;&lt;BR/&gt;- Steve melting in a hot Aussie summer</description>
		<content:encoded><![CDATA[<p>Oh don&#8217;t get me wrong, your RegEx will get what you&#8217;re after, it&#8217;s the fact that it *can* get more that makes it too greedy.</p>
<p>As we&#8217;ve discussed privately - context is *everything*. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>If your website only ever has a single level  directory tree, there is no problem.<br />If you don&#8217;t care about intervening directories - ditto.</p>
<p>When I&#8217;m trying out a more complex RegEx, I&#8217;ll usually create a small set (10-20) of test cases to see what matches and what doesn&#8217;t. Try and cover all possibilities of what should and shouldn&#8217;t match, and see if I got it right. I usually don&#8217;t. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>So for your original case I&#8217;d try:</p>
<p>/mypage/index.htm     (Y)<br />/mypage/index.html    (N)<br />/MyPage/Index.HTM     (Y?)<br />/otherpage/index.htm  (N)<br />/mypage/other.htm     (Y)<br />mypage/other.htm      (N)<br />other.htm             (N)<br />other.gif             (N)<br />/mypage/index.jpg     (N)<br />/mypage/lower/index.htm  (N?)<br />/mypage               (N)<br />/mypage/              (N)<br />/mypage/htm           (N)</p>
<p>/mypage/index.htm?id=1234  (N, but should it?)</p>
<p>where the bracketed Y or N is not included in the test data, but shows what I&#8217;d expect to see match or not.<br />Some of these are self evident &#8220;Won&#8217;t Work&#8221;, but it&#8217;s always good to verify. I speak from shoddy RegEx self-construction experience. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>What you don&#8217;t match is just as important as what you do.</p>
<p>- Steve melting in a hot Aussie summer</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: LunaMetrics Blog</title>
		<link>http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/#comment-290</link>
		<dc:creator>LunaMetrics Blog</dc:creator>
		<pubDate>Tue, 05 Dec 2006 04:18:00 +0000</pubDate>
		<guid isPermaLink="false">http://72.22.16.69/blog/?p=262#comment-290</guid>
		<description>Hmm... well, I got what I wanted, all the pages that start with /mypage and end with .htm.&lt;BR/&gt;&lt;BR/&gt;Having said that, you allude to a great point -- the ability of the RegEx to do things you don't expect. I think that RegExperts like you are rare and should be worshipped. I am certainly one of your RegEx groupies. And always your RegEx student.&lt;BR/&gt;&lt;BR/&gt;Robbin</description>
		<content:encoded><![CDATA[<p>Hmm&#8230; well, I got what I wanted, all the pages that start with /mypage and end with .htm.</p>
<p>Having said that, you allude to a great point &#8212; the ability of the RegEx to do things you don&#8217;t expect. I think that RegExperts like you are rare and should be worshipped. I am certainly one of your RegEx groupies. And always your RegEx student.</p>
<p>Robbin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/#comment-291</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Sun, 03 Dec 2006 21:21:00 +0000</pubDate>
		<guid isPermaLink="false">http://72.22.16.69/blog/?p=262#comment-291</guid>
		<description>&lt;B&gt;^/mypage/.*\.htm$&lt;/B&gt;&lt;BR/&gt;&lt;BR/&gt;Won't just do what you're after. This is where the .* construct can be too greedy. :-)&lt;BR/&gt;&lt;BR/&gt;As this will also match:&lt;BR/&gt;/mypage/otherfolder/index.htm&lt;BR/&gt;&lt;BR/&gt;Which I suspect from your description is not what you're after?&lt;BR/&gt;&lt;BR/&gt;If you want only .htm files in the mypage directory, you need to construct the RegEx to stop from including any other directories.&lt;BR/&gt;How are directories identified? By slashes. So how do we not get a directory? By not having one or more slashes.&lt;BR/&gt;&lt;BR/&gt;So the extract becomes:&lt;BR/&gt;/[^/]+\.htm$&lt;BR/&gt;&lt;BR/&gt;/ - a directory (in our case to be prefaced by ^/mypage )&lt;BR/&gt;&lt;BR/&gt;[^/] - Any *single* character, but NOT a slash.&lt;BR/&gt;&lt;BR/&gt;+ - One or more of the previous. Unless you really expect to have a file called ".htm", always expect at least one character. "a.htm".&lt;BR/&gt;&lt;BR/&gt;Rest is as you've described.&lt;BR/&gt;&lt;BR/&gt;So the full RegEx becomes:&lt;BR/&gt;&lt;BR/&gt;^/mypage/[^/]+\.htm$&lt;BR/&gt;&lt;BR/&gt;&lt;BR/&gt;The trick with this style of RegEx construction is to try and figure out what *else* you could match, inadvertently. This is one of the reasons why I try really hard not to use ".*", preferring ".+", as the zero case of ".*" can be unexpected.&lt;BR/&gt;&lt;BR/&gt;HTH?&lt;BR/&gt;&lt;BR/&gt;- Steve, GoobleDeGook Provider Extraordinaire</description>
		<content:encoded><![CDATA[<p><b>^/mypage/.*\.htm$</b></p>
<p>Won&#8217;t just do what you&#8217;re after. This is where the .* construct can be too greedy. <img src='http://www.lunametrics.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>As this will also match:<br />/mypage/otherfolder/index.htm</p>
<p>Which I suspect from your description is not what you&#8217;re after?</p>
<p>If you want only .htm files in the mypage directory, you need to construct the RegEx to stop from including any other directories.<br />How are directories identified? By slashes. So how do we not get a directory? By not having one or more slashes.</p>
<p>So the extract becomes:<br />/[^/]+\.htm$</p>
<p>/ - a directory (in our case to be prefaced by ^/mypage )</p>
<p>[^/] - Any *single* character, but NOT a slash.</p>
<p>+ - One or more of the previous. Unless you really expect to have a file called &#8220;.htm&#8221;, always expect at least one character. &#8220;a.htm&#8221;.</p>
<p>Rest is as you&#8217;ve described.</p>
<p>So the full RegEx becomes:</p>
<p>^/mypage/[^/]+\.htm$</p>
<p>The trick with this style of RegEx construction is to try and figure out what *else* you could match, inadvertently. This is one of the reasons why I try really hard not to use &#8220;.*&#8221;, preferring &#8220;.+&#8221;, as the zero case of &#8220;.*&#8221; can be unexpected.</p>
<p>HTH?</p>
<p>- Steve, GoobleDeGook Provider Extraordinaire</p>
]]></content:encoded>
	</item>
</channel>
</rss>
