Regular Expressions Part XII: Bad Greed

By /

December 2, 2006

Now that I have learned and explained the Regular Expressions that Google Analytics uses:

Backslashes
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes –
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

I want to explore another area: Regular Expressions and the concept of greediness.

You might be tempted to write a Regular Expression like this:

/mypage/

expecting it to match the page on your site called /mypage/

And this Regular Expression really does match /mypage/. But it also matches /mypage/thirdpage-and-something-else . For that matter, it matches /secondpage/mypage.html and mypage.htm and mypage.asp.

regular-expressionsRegular Expressions are greedy — they match and match as much as they can. Greed can be good, but first I want to write about the obvious problem, i.e. the RegEx (Regular Expression) will match too many strings to be useful.

We can deal with this in various ways:

1) Tell the RegEx where to start. In the above example, if we wrote the RegEx like this

^/mypage/

it will only match when /mypage/ is at the beginning of the line, so it will never match /secondpage/mypage/ etc.

2) Tell the RegEx when to stop. We can do this in various ways, but need to know how the expression ends. For example, are we only looking for mypage.htm or are we also looking for all the pages that are in the /mypage/ folder — /mypage/otherpages.htm? If only mypage.htm matters, then we can include that in the RegEx:

/mypage.htm$

Notice that I used a backslash to make the dot into a real dot and not a special character, and a dollar sign to say, this only works at the end of the line. (That way, mypage.html won’t match.)

We can combine that with #1 above and create this RegEx:

^/mypage.htm$

and never get any unexpected characters before the slash or after the htm.

On the other hand, if all the pages in the /mypage/ folder that have .htm suffixes are of interest, we could do this differently:

^/mypage/.*.htm$

I really hate when people throw this kind of gobbledygook at me so let me see if I can explain in pieces:

^/mypage/ = only consider the match if /mypage/ is at the start of a line…
.* = match everything that comes next until…
.htm$ = you get to the last real period followed by htm and it’s at the end of a line.

Clear as mud, eh?

Robbin

Our owner and CEO, Robbin Steif, started LunaMetrics ten years ago. She is a graduate of Harvard College and the Harvard Business School, and has served on the Board of Directors for the Digital Analytics Association. Robbin is a winner of a BusinessWomen First award, as well as a recent Diamond Award for business leadership. You should read her letter before you decide to work with us.

  • Anonymous

    ^/mypage/.*.htm$

    Won’t just do what you’re after. This is where the .* construct can be too greedy. :-)

    As this will also match:
    /mypage/otherfolder/index.htm

    Which I suspect from your description is not what you’re after?

    If you want only .htm files in the mypage directory, you need to construct the RegEx to stop from including any other directories.
    How are directories identified? By slashes. So how do we not get a directory? By not having one or more slashes.

    So the extract becomes:
    /[^/]+.htm$

    / – a directory (in our case to be prefaced by ^/mypage )

    [^/] – Any *single* character, but NOT a slash.

    + – One or more of the previous. Unless you really expect to have a file called “.htm”, always expect at least one character. “a.htm”.

    Rest is as you’ve described.

    So the full RegEx becomes:

    ^/mypage/[^/]+.htm$

    The trick with this style of RegEx construction is to try and figure out what *else* you could match, inadvertently. This is one of the reasons why I try really hard not to use “.*”, preferring “.+”, as the zero case of “.*” can be unexpected.

    HTH?

    – Steve, GoobleDeGook Provider Extraordinaire

  • http://www.lunametrics.com/blog LunaMetrics Blog

    Hmm… well, I got what I wanted, all the pages that start with /mypage and end with .htm.

    Having said that, you allude to a great point — the ability of the RegEx to do things you don’t expect. I think that RegExperts like you are rare and should be worshipped. I am certainly one of your RegEx groupies. And always your RegEx student.

    Robbin

  • Anonymous

    Oh don’t get me wrong, your RegEx will get what you’re after, it’s the fact that it *can* get more that makes it too greedy.

    As we’ve discussed privately – context is *everything*. :-)

    If your website only ever has a single level directory tree, there is no problem.
    If you don’t care about intervening directories – ditto.

    When I’m trying out a more complex RegEx, I’ll usually create a small set (10-20) of test cases to see what matches and what doesn’t. Try and cover all possibilities of what should and shouldn’t match, and see if I got it right. I usually don’t. :-)

    So for your original case I’d try:

    /mypage/index.htm (Y)
    /mypage/index.html (N)
    /MyPage/Index.HTM (Y?)
    /otherpage/index.htm (N)
    /mypage/other.htm (Y)
    mypage/other.htm (N)
    other.htm (N)
    other.gif (N)
    /mypage/index.jpg (N)
    /mypage/lower/index.htm (N?)
    /mypage (N)
    /mypage/ (N)
    /mypage/htm (N)

    /mypage/index.htm?id=1234 (N, but should it?)

    where the bracketed Y or N is not included in the test data, but shows what I’d expect to see match or not.
    Some of these are self evident “Won’t Work”, but it’s always good to verify. I speak from shoddy RegEx self-construction experience. :-)

    What you don’t match is just as important as what you do.

    – Steve melting in a hot Aussie summer

  • http://www.lunametrics.com/blog LunaMetrics Blog

    This is a great way of doing it. BTW, for a custom profile filter in GA, you get the option of telling the filter if case matters.

    It is freezing here in Chicago, I would take Australia in a NY minute.

    Robbin

  • http://Medicine.org Nikki

    Thank you!!! I’ve been struggling with Google Analytics, trying to come up with an expression that matched 1 term at the beginning of the URL and another term at the end. Problem solved :-) I just wish I’d found your site and its clear examples months ago…

  • Pingback: Dear Avinash: Web Metrics & Analytics Questions, Facebook Edition | rapid-DEV.net()

  • http://www.sponsor121.com james

    I have been searching for this information for TOO LONG. thank you so so much…

  • Pingback: Regular Expressions for Google Analytics Part VII: (Parenthesis)()

  • Pingback: Google Analytics Regular Expressions Part IV: The ending Anchor $()

Contact Us.

LunaMetrics

24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.