Regular Expressions for GA, Bonus II: Minimal Matching/
July 29, 2007
Here it is … a straggling installment on my year-long saga to bring Regular Expressions (RegEx) to regular people. Today, I want to write about how to make your expressions less greedy. It’s a feature called “minimal matching.” Unfortunately for you advanced learners, you have to wait until the end of the article until you see the application. I always want to teach with the easiest of examples (however meaningless they may be…)
Last winter, I wrote about how RegEx are greedy. But Google Analytics uses a flavor of RegEx that allows you to make them less greedy.
Let’s start with a fairly stupid but easy to understand example. We’ll assume that the phrase we want to match to is baaaaa. We could create a regular expression like this:
That means, start matching at the b, match the a next, and then match one or more a’s. (Right? The plus sign means, match one or more of the previous expression, which is so often, just the previous character.) So this RegEx that I wrote, ba+, will match the entire string:
(If that doesn’t make sense to you, remember that RegEx are greedy, and the default is, they match as much as they possibly can. Here, they will match all the a’s we give it, since the plus sign means, one or more.)
But what if we wanted to match just the first b and the first a, like this: ba
We need to specify minimal matching (and hang in there with me, soon it will all be worthwhile.) We can match to just the first instance of a by using a question mark. Like this:
When Justin taught this to me (months ago), I asked, “How does the RegEx know that the question mark is doing minimal matching, and not there to say, match zero or one of the last character?” Well, it turns out that those are the same thing — the question mark in the expression above tells the plus sign, “You can match to one a but no more.”
This concept is actually very helpful when you are matching to URI strings. For example, let’s say you had a string like this:
And you wanted to rewrite that gibberish to be something meaningful. But because it is gibberish, it looks different every time. You could write a RegEx like this:
But, it will match to the last slash because the RegEx are greedy. However, if you use minimal matching, like this:
it will match to only the first slash. And that’s what minimal matching is all about.
Plus signs +
Dollars signs $
Question marks ?
Square brackets and dashes -
Plus signs +
Regular Expressions for Google Analytics: Now we will Practice
RegEx and Good Greed
Intro to RegEx