412.343.3692
1.800.975.1844

Regular Expressions for GA, Bonus II: Minimal Matching

Here it is … a straggling installment on my year-long saga to bring Regular Expressions (RegEx) to regular people. Today, I want to write about how to make your expressions less greedy. It’s a feature called “minimal matching.” Unfortunately for you advanced learners, you have to wait until the end of the article until you see the application. I always want to teach with the easiest of examples (however meaningless they may be…)

Last winter, I wrote about how RegEx are greedy. But Google Analytics uses a flavor of RegEx that allows you to make them less greedy.

Let’s start with a fairly stupid but easy to understand example. We’ll assume that the phrase we want to match to is baaaaa. We could create a regular expression like this:

ba+

That means, start matching at the b, match the a next, and then match one or more a’s. (Right? The plus sign means, match one or more of the previous expression, which is so often, just the previous character.) So this RegEx that I wrote, ba+, will match the entire string:

baaaaa

(If that doesn’t make sense to you, remember that RegEx are greedy, and the default is, they match as much as they possibly can. Here, they will match all the a’s we give it, since the plus sign means, one or more.)

But what if we wanted to match just the first b and the first a, like this: ba

We need to specify minimal matching (and hang in there with me, soon it will all be worthwhile.) We can match to just the first instance of a by using a question mark. Like this:

ba+?

When Justin taught this to me (months ago), I asked, “How does the RegEx know that the question mark is doing minimal matching, and not there to say, match zero or one of the last character?” Well, it turns out that those are the same thing — the question mark in the expression above tells the plus sign, “You can match to one a but no more.”

This concept is actually very helpful when you are matching to URI strings. For example, let’s say you had a string like this:

gibberish/foldertwo/folderthree

And you wanted to rewrite that gibberish to be something meaningful. But because it is gibberish, it looks different every time. You could write a RegEx like this:

.*/

But, it will match to the last slash because the RegEx are greedy. However, if you use minimal matching, like this:

.*?/

it will match to only the first slash. And that’s what minimal matching is all about.

Plus signs +
Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
Lookahead 

Robbin

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • StumbleUpon
  • Sphinn
  • Facebook

5 Responses to “Regular Expressions for GA, Bonus II: Minimal Matching”

  1. Alan Says:

    Hey Robbin,

    Hope you’re keeping well\ Sorry I’ve been really bad at keeping in touch. When I read this post in the Paris Metro this morning I thought now might as good a time as any to show a sign of life again :)

    This is a really cool RegEx. I had no idea this worked. I would normally have matched to the first slash by creating a range and excluding the slash like so: [^/]*
    However, it’s definitely easier your way!

    Thanks and take care!
    Alan

  2. Robbin Says:
    Hi Allen! It’s good to know you are still there, riding the Paris Metro.

    This is because Google uses Perl Compatible Regular Expressions, (PCRE). PCRE actually have other capabilities, but I haven’t taken the time to explore them, just this one. I look to a reader in Australia, Steve, to make those comments and keep me in line when it comes to RegEx. So maybe we’ll hear from him.

  3. steve Says:

    To prove the adage: “No matter how much you know about a topic, there’s always something new to learn”: As I replied privately to Robbin this morning (”Subtle. Really Subtle,” was the lead in. ;-) ), I only found out about this style of RegEx myself late last year or early this. I’ve been doing RegEx’s since ‘88!
    A friend in Austria (ie. next to Germany) used one to solve a funky problem we had. I was like “How on earth does that work? And what’s with the funky ‘.*?’ construct!?!!?!?”

    Chapter 4 of the O’Reilly “Mastering Regular Expressions” (MRE) has a … deep explanation of these “lazy” expressions as Jeff refers to them. “Greedy” or “Lazy”. Sigh. IT people and the art of *bad* punning. ;-)

    Alan, lazy expressions are, in my experience, generally slower than greedy ones. I have a program that does a lot of matching using the actual PCRE library[1]. Lots of ‘[^ ]+ ‘ (left_square not space right_square plus space) style of thing. Replacing with ‘.+? ‘ (dot plus question space) to get the equivalent lazy?
    Go from ~ 80,000 lines/sec to ~65,000 lines/sec. If you read on to Chapter 6 in MRE, Jeff explains the whys and hows of this observed slowdown.
    Without going into the detail, lazy expressions can cause the underlying engine to do more work. Of course, by “tomorrow” there may be new and improved optimisations that render the previous statements incorrect. ;-) [2]

    Now as far as GA is concerned? Super speedy regex’s aren’t something we really need to worry about. Google do, we don’t. Yet. ;-) In that situation, I recommend going for what you find easiest to *understand*.

    Do be aware that: “.*?/” vs ‘[^/]*’ will give different answers! They are not identical as written. What you probably want to compare with is: ‘[^/]*/’.

    Cheers!
    - Steve

    [1] http://www.pcre.org. Has links to the Perl RegEx docco. Which could be a bit … hairy for many readers. And very perl specific.
    [2] The optimisation we use is to have two regex’s. One, using greedy, is for most of the work. If that fails, we switch to the slightly more complex, plus lazy usage to try and match a 2nd time. Thus we get the best of both worlds. Speed and Correctness. It’s possible that Justin (who knows way more about GA ins and outs than me) may know of a way to expose and use such a double hit filter????

  4. Robbin Says:
    Steve, you couldn’t know this but Alan is quite the RegEx expert (and as you point out, we always find new things about RegEx.) Also — Steve, we wouldn’t want to try to match to the greedy expression and then to the less greedy expression. That would usually defeat our project. It might be okay if we were matching for keywords, but if you were rewriting uris, it would be pretty important that you do it right.

    So let Google worry that these are slightly slow. People used to say, don’t use a star * because it slows down the processing. And does anyone listen? No. And their processing is just fine (well, if it isn’t fine, it’s not because of their choice of RegEx…)

  5. steve Says:

    “Steve, you couldn’t know this but Alan is quite the RegEx expert”
    Ha! Translation: You’ve been “Trying to teach grandma how to suck eggs”. Oops. :-)

    My apologies Alan. Please take my prior in the positive light of good intent, and not as the ravings of a pompous windbag. No matter how accurate the latter may be. ;-)

    Cheers!
    - Steve

Leave a Reply