Upcoming LunaMetrics Seminars
Washington DC, Sep 22-26 Boston, Oct 6-10 Chicago, Oct 20-24 Seattle, Nov 3-7

Regular Expressions Part XII: Bad Greed





Now that I have learned and explained the Regular Expressions that Google Analytics uses:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

I want to explore another area: Regular Expressions and the concept of greediness.

You might be tempted to write a Regular Expression like this:

/mypage/

expecting it to match the page on your site called /mypage/

And this Regular Expression really does match /mypage/. But it also matches /mypage/thirdpage-and-something-else . For that matter, it matches /secondpage/mypage.html and mypage.htm and mypage.asp.

regular-expressionsRegular Expressions are greedy — they match and match as much as they can. Greed can be good, but first I want to write about the obvious problem, i.e. the RegEx (Regular Expression) will match too many strings to be useful.

We can deal with this in various ways:

1) Tell the RegEx where to start. In the above example, if we wrote the RegEx like this

^/mypage/

it will only match when /mypage/ is at the beginning of the line, so it will never match /secondpage/mypage/ etc.

2) Tell the RegEx when to stop. We can do this in various ways, but need to know how the expression ends. For example, are we only looking for mypage.htm or are we also looking for all the pages that are in the /mypage/ folder — /mypage/otherpages.htm? If only mypage.htm matters, then we can include that in the RegEx:

/mypage\.htm$

Notice that I used a backslash to make the dot into a real dot and not a special character, and a dollar sign to say, this only works at the end of the line. (That way, mypage.html won’t match.)

We can combine that with #1 above and create this RegEx:

^/mypage\.htm$

and never get any unexpected characters before the slash or after the htm.

On the other hand, if all the pages in the /mypage/ folder that have .htm suffixes are of interest, we could do this differently:

^/mypage/.*\.htm$

I really hate when people throw this kind of gobbledygook at me so let me see if I can explain in pieces:

^/mypage/ = only consider the match if /mypage/ is at the start of a line…
.* = match everything that comes next until…
\.htm$ = you get to the last real period followed by htm and it’s at the end of a line.

Clear as mud, eh?

Robbin

Robbin Steif

About Robbin Steif

Our owner and CEO, Robbin Steif, started LunaMetrics ten years ago. She is a graduate of Harvard College and the Harvard Business School, and has served on the Board of Directors for the Digital Analytics Association. Robbin is a recent winner of a BusinessWomen First award, as well as a Diamond Award for business leadership.

http://www.lunametrics.com/blog/2006/12/02/regular-expressons-part-xii-bad-greed/

7 Responses to “Regular Expressions Part XII: Bad Greed”

Anonymous says:

^/mypage/.*\.htm$

Won’t just do what you’re after. This is where the .* construct can be too greedy. :-)

As this will also match:
/mypage/otherfolder/index.htm

Which I suspect from your description is not what you’re after?

If you want only .htm files in the mypage directory, you need to construct the RegEx to stop from including any other directories.
How are directories identified? By slashes. So how do we not get a directory? By not having one or more slashes.

So the extract becomes:
/[^/]+\.htm$

/ – a directory (in our case to be prefaced by ^/mypage )

[^/] – Any *single* character, but NOT a slash.

+ – One or more of the previous. Unless you really expect to have a file called “.htm”, always expect at least one character. “a.htm”.

Rest is as you’ve described.

So the full RegEx becomes:

^/mypage/[^/]+\.htm$


The trick with this style of RegEx construction is to try and figure out what *else* you could match, inadvertently. This is one of the reasons why I try really hard not to use “.*”, preferring “.+”, as the zero case of “.*” can be unexpected.

HTH?

- Steve, GoobleDeGook Provider Extraordinaire

Hmm… well, I got what I wanted, all the pages that start with /mypage and end with .htm.

Having said that, you allude to a great point — the ability of the RegEx to do things you don’t expect. I think that RegExperts like you are rare and should be worshipped. I am certainly one of your RegEx groupies. And always your RegEx student.

Robbin

Anonymous says:

Oh don’t get me wrong, your RegEx will get what you’re after, it’s the fact that it *can* get more that makes it too greedy.

As we’ve discussed privately – context is *everything*. :-)

If your website only ever has a single level directory tree, there is no problem.
If you don’t care about intervening directories – ditto.

When I’m trying out a more complex RegEx, I’ll usually create a small set (10-20) of test cases to see what matches and what doesn’t. Try and cover all possibilities of what should and shouldn’t match, and see if I got it right. I usually don’t. :-)

So for your original case I’d try:

/mypage/index.htm (Y)
/mypage/index.html (N)
/MyPage/Index.HTM (Y?)
/otherpage/index.htm (N)
/mypage/other.htm (Y)
mypage/other.htm (N)
other.htm (N)
other.gif (N)
/mypage/index.jpg (N)
/mypage/lower/index.htm (N?)
/mypage (N)
/mypage/ (N)
/mypage/htm (N)

/mypage/index.htm?id=1234 (N, but should it?)


where the bracketed Y or N is not included in the test data, but shows what I’d expect to see match or not.
Some of these are self evident “Won’t Work”, but it’s always good to verify. I speak from shoddy RegEx self-construction experience. :-)


What you don’t match is just as important as what you do.

- Steve melting in a hot Aussie summer

This is a great way of doing it. BTW, for a custom profile filter in GA, you get the option of telling the filter if case matters.

It is freezing here in Chicago, I would take Australia in a NY minute.

Robbin

Nikki says:

Thank you!!! I’ve been struggling with Google Analytics, trying to come up with an expression that matched 1 term at the beginning of the URL and another term at the end. Problem solved :-) I just wish I’d found your site and its clear examples months ago…

[...] “I am the queen of GA expressions” Steif: Regular Expressions Part XII: Bad Greed. Yes that’s part [...]

james says:

I have been searching for this information for TOO LONG. thank you so so much…