Upcoming LunaMetrics Seminars
Boston, Oct 6-10 Chicago, Oct 20-24 Seattle, Nov 3-7 New York City, Nov 17-21

Regular Expressions for GA, Bonus II: Minimal Matching





regular-expressionsHere it is … a straggling installment on my year-long saga to bring Regular Expressions (RegEx) to regular people. Today, I want to write about how to make your expressions less greedy. It’s a feature called “minimal matching.” Unfortunately for you advanced learners, you have to wait until the end of the article until you see the application. I always want to teach with the easiest of examples (however meaningless they may be…)

Last winter, I wrote about how RegEx are greedy. But Google Analytics uses a flavor of RegEx that allows you to make them less greedy.

Let’s start with a fairly stupid but easy to understand example. We’ll assume that the phrase we want to match to is baaaaa. We could create a regular expression like this:

ba+

That means, start matching at the b, match the a next, and then match one or more a’s. (Right? The plus sign means, match one or more of the previous expression, which is so often, just the previous character.) So this RegEx that I wrote, ba+, will match the entire string:

baaaaa

(If that doesn’t make sense to you, remember that RegEx are greedy, and the default is, they match as much as they possibly can. Here, they will match all the a’s we give it, since the plus sign means, one or more.)

But what if we wanted to match just the first b and the first a, like this: ba

We need to specify minimal matching (and hang in there with me, soon it will all be worthwhile.) We can match to just the first instance of a by using a question mark. Like this:

ba+?

When Justin taught this to me (months ago), I asked, “How does the RegEx know that the question mark is doing minimal matching, and not there to say, match zero or one of the last character?” Well, it turns out that those are the same thing — the question mark in the expression above tells the plus sign, “You can match to one a but no more.”

This concept is actually very helpful when you are matching to URI strings. For example, let’s say you had a string like this:

gibberish/foldertwo/folderthree

And you wanted to rewrite that gibberish to be something meaningful. But because it is gibberish, it looks different every time. You could write a RegEx like this:

.*/

But, it will match to the last slash because the RegEx are greedy. However, if you use minimal matching, like this:

.*?/

it will match to only the first slash. And that’s what minimal matching is all about.

Plus signs +
Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
Lookahead 

Robbin

Robbin Steif

About Robbin Steif

Our owner and CEO, Robbin Steif, started LunaMetrics ten years ago. She is a graduate of Harvard College and the Harvard Business School, and has served on the Board of Directors for the Digital Analytics Association. Robbin is a recent winner of a BusinessWomen First award, as well as a Diamond Award for business leadership.

http://www.lunametrics.com/blog/2007/07/29/regular-expressions-for-ga-bonus-ii-minimal-matching/

12 Responses to “Regular Expressions for GA, Bonus II: Minimal Matching”

Alan says:

Hey Robbin,

Hope you’re keeping well\ Sorry I’ve been really bad at keeping in touch. When I read this post in the Paris Metro this morning I thought now might as good a time as any to show a sign of life again :)

This is a really cool RegEx. I had no idea this worked. I would normally have matched to the first slash by creating a range and excluding the slash like so: [^/]*
However, it’s definitely easier your way!

Thanks and take care!
Alan

Robbin says:

Hi Allen! It’s good to know you are still there, riding the Paris Metro.

This is because Google uses Perl Compatible Regular Expressions, (PCRE). PCRE actually have other capabilities, but I haven’t taken the time to explore them, just this one. I look to a reader in Australia, Steve, to make those comments and keep me in line when it comes to RegEx. So maybe we’ll hear from him.

steve says:

To prove the adage: “No matter how much you know about a topic, there’s always something new to learn”: As I replied privately to Robbin this morning (“Subtle. Really Subtle,” was the lead in. ;-) ), I only found out about this style of RegEx myself late last year or early this. I’ve been doing RegEx’s since ’88!
A friend in Austria (ie. next to Germany) used one to solve a funky problem we had. I was like “How on earth does that work? And what’s with the funky ‘.*?’ construct!?!!?!?”

Chapter 4 of the O’Reilly “Mastering Regular Expressions” (MRE) has a … deep explanation of these “lazy” expressions as Jeff refers to them. “Greedy” or “Lazy”. Sigh. IT people and the art of *bad* punning. ;-)

Alan, lazy expressions are, in my experience, generally slower than greedy ones. I have a program that does a lot of matching using the actual PCRE library[1]. Lots of ‘[^ ]+ ‘ (left_square not space right_square plus space) style of thing. Replacing with ‘.+? ‘ (dot plus question space) to get the equivalent lazy?
Go from ~ 80,000 lines/sec to ~65,000 lines/sec. If you read on to Chapter 6 in MRE, Jeff explains the whys and hows of this observed slowdown.
Without going into the detail, lazy expressions can cause the underlying engine to do more work. Of course, by “tomorrow” there may be new and improved optimisations that render the previous statements incorrect. ;-) [2]

Now as far as GA is concerned? Super speedy regex’s aren’t something we really need to worry about. Google do, we don’t. Yet. ;-) In that situation, I recommend going for what you find easiest to *understand*.

Do be aware that: “.*?/” vs ‘[^/]*’ will give different answers! They are not identical as written. What you probably want to compare with is: ‘[^/]*/’.

Cheers!
- Steve

[1] http://www.pcre.org. Has links to the Perl RegEx docco. Which could be a bit … hairy for many readers. And very perl specific.
[2] The optimisation we use is to have two regex’s. One, using greedy, is for most of the work. If that fails, we switch to the slightly more complex, plus lazy usage to try and match a 2nd time. Thus we get the best of both worlds. Speed and Correctness. It’s possible that Justin (who knows way more about GA ins and outs than me) may know of a way to expose and use such a double hit filter????

Robbin says:

Steve, you couldn’t know this but Alan is quite the RegEx expert (and as you point out, we always find new things about RegEx.) Also — Steve, we wouldn’t want to try to match to the greedy expression and then to the less greedy expression. That would usually defeat our project. It might be okay if we were matching for keywords, but if you were rewriting uris, it would be pretty important that you do it right.

So let Google worry that these are slightly slow. People used to say, don’t use a star * because it slows down the processing. And does anyone listen? No. And their processing is just fine (well, if it isn’t fine, it’s not because of their choice of RegEx…)

steve says:

“Steve, you couldn’t know this but Alan is quite the RegEx expert”
Ha! Translation: You’ve been “Trying to teach grandma how to suck eggs”. Oops. :-)

My apologies Alan. Please take my prior in the positive light of good intent, and not as the ravings of a pompous windbag. No matter how accurate the latter may be. ;-)

Cheers!
- Steve

This is a great article … I am a RegEx addict … Every time I hear about someone doing something powerful with RegEx … I get excited … yes … I am a RegEx geek …

Nice post …

Edward

Robbin says:

Hi Edward. I am glad you like this one, it is my favorite. I go back to this article over and over again. Robbin

Nuttakorn says:

What do you think if I would filter out some page like this case :

http://www.domainname.com/Category/default.aspx —– Filter out this page

But keep track for all page under http://www.domainname.com/Category/Product1, http://www.domainname.com/Category/Product1 , etc.

Robbin says:

Nuttakorn – you can exclude this page:

/Category/default\.aspx

That way, Category/Product1 (etc) will still get included. The exclude I did above is very specific and doesn’t encompass the other pages that you want included.

Alexander says:

Hello Allen, great article.

However, ba+? could be written much simpler to match this pattern.

why not use ba only that will match only one character of a after b. ba+? looks like it might work but confuses.

Is ba+? formulation needed because GA RegEx is different or is this your way of matching a pattern of ba ?

Thank you in advance for your reply.

Alexander

Robbin says:

Alexander, ba will match ba and baby and bath and all sorts of other things. There are other ways to stop “RegEx greed,” like ba$, but I was just using that to illustrate a concept of minimal matching.

Robbin

Jennifer says:

Ha, I came here expecting this to have something to do with baby bath. Turns out it was just the comment above by Robbin that must have triggered it.