412.343.3692
1.800.975.1844

Regular Expressions for GA, Bonus III: Lookahead

2007_02230074.JPGI hope this will be the last installment, for some time to come, of my Regular Expressions (RegEx) for Google Analytics series. At the end of the post, I have finally threaded all the RegEx posts.

Tonight you can see some very cool (if you care) regular expressions for GA that look ahead and decide whether the match is allowed. Sort of, conditional match. There are two kinds of look aheads: negative lookahead (don’t match if) and positive lookahead (only match if …) Like braces, this is a ReGex that works in Google Analytics, but there is no documentation on it.

Here’s an example that I had today. I was working with a GA site where they have membership, not customers. The site owners use the word members in some of their URIs, and they also use membermail and memberthis and memberthat and membertheother. And a whole lot of other memberexamples.

So let’s say that I needed to create a filter with a regular expression to include all those member URIs in GA, but I want to make sure that I don’t include membermail, since mail is in a special category in my marketing mix. Now we’re in a position to formulate the question really well: how do we include all the Request URIs that have the string member in them where that string is not followed by the string mail? In other words — don’t match member if it is followed by mail.

Steve gets the credit for this one. He suggested that GA might work with negative lookahead - the ability to combine regular expressions to say, “don’t match if it is followed by…” In our membermail example, the expression would be member(?!mail) .

The opening of the parenthesis, followed by a question mark, tells the RegEx engine, “Watch out, lookahead coming.” The exclamation point says, “And it’s a negative.” Combined, they mean, it’s a negative lookahead - don’t match the first part of the string if the second part is there. Don’t match member if it is followed by mail .

GA also handles positive lookahead. So if we want to match only membermail and not memberthis or memberthat and all the other uris with member, we can write our RegEx like this: member(?=mail) – in this case the open paren and the question mark do the same thing (”Watch out, lookahead coming,”) but the equal sign says, “And it’s a positive match.”

There is one last little fine point to wrap your head around, assuming you are not dizzy already. The lookahead string, or whatever you want to call the string mail in my example, is not part of the match. I know this sounds like gibberish, so let me give a last example. This one’s for all you RegEx fans. And for everyone on the Paris metro:

Example: Let’s say that I am doing a positive lookahead like this: member(?=s)hip. This means, match to member only if it is followed by an s, and then please match to the hip, too. . However, the string membership would not be a match. That seems a little ridiculous. After all, it is member, followed by an s, followed by hip, right?

Well, it doesn’t work that way. That’s because the s is only a condition. In the eyes of the RegEx engine, you sort of have a conditional regex that looks like this:

memberhip (notice, no s)

And, you are trying to match to

membership

It’s not a match, because the s isn’t part of the RegEx. It was just being used as a condition.

OK, we are done for tonight, and

Late note: A reader here, Alan, wrote a wonderful comment, whereby he shows other ways to implement and use the power of positive lookahead. You should read his comment, but the short version is, you can use positive lookahead to match if the “condition” is somewhere down the line. The “lookahead string” — the condition — doesn’t have to *immediately* follow the (?!) if you use the syntax that he figured out. (See, now you have to read his stuff….)

Here is the thread with all the RegEx posts.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

– Robbin

Share and Enjoy:
  • Digg
  • del.icio.us
  • StumbleUpon
  • Sphinn
  • Facebook

4 Responses to “Regular Expressions for GA, Bonus III: Lookahead”

  1. Alan Says:

    Reporting live from the Paris métro…
    (specifically: RER Ligne A, between Ch

  2. Alan Says:

    Reporting live from the Paris métro…
    (specifically: RER Ligne A, between Châtelet-les-Halles and Auber, for those who like the details)

    This is really powerful Robbin. Everytime I think that RegEx has no more secrets for me, you come up with something new on this blog.

    My fellow metro commuters (Serge, Jean-Pierre and Jean-François – just kidding) and myself were just debating about the last example you used: member(?=s)hip, and though it is understood that it will NOT match “membership”, we were wondering what it WOULD match, if anything…
    Personally, my money was on it matching “memberhips” (match “member” – as long as it is followed at some point further down the line by an ‘s’ – followed by “hip”). Alas, I’m wrong again, as is so often the case in these situations. I tested it in the RegEx Coach, and it failed miserably. As the RegEx is written, “member” needs to be immediately followed by “s”.

    However, I played around with it a bit further and found a way to make these lookaheads function in a “anywhere further down the line” kinda way. If we use a RegEx like the following:

    member(?=.*s)hip

    then it WILL match “memberhips” (or rather it will match “memberhip” so long as it is followed further in the string by an ‘s’).
    I really like the fact that the (?=.*s) leapfrogs the following element “hip” as it illustrates well what you were saying about it being a mere condition that qualifies “member” and NOT an element in the string that needs to be taken into account in the sequence.

    We could imagine the follwing URIs:
    /checkout.php?…6billion_parameters…&finalOutcome=credit_card_visa_ok&step=purchase
    /checkout.php?…6billion_parameters…&finalOutcome=mail_cheque_pending&step=purchase
    /checkout.php?…6billion_parameters…&finalOutcome=debit_card_mastercard&step=purchase
    /checkout.php?…6billion_parameters…&finalOutcome=credit_card_amex&step=purchase
    /checkout.php?…6billion_parameters…&finalOutcome=credit_card_error&step=purchase
    etc.

    In this instance let us imagine wanting to match all /checkout.php URIs with the parameter step=purchase, UNLESS the value of the “finalOutcome=” parameter is “credit_card_error”.
    We could simply use the following RegEx:

    /checkout\.php\?(?!.*finalOutcome=credit_card_error).*&step=purchase

    Read this as: match all URIs that contain “/checkout.php?” (so long as you can’t find anywhere in the string following “finalOutcome=credit_card_error”), followed by any text, followed by “&step=purchase”

    I can’t imagine the number of times I’ve had a need for these conditional lookaheads, but not being aware of their existence, I ended up with a ludicrously verbose and obscene RegEx instead
    .Thanks for furthering our knowledge once again :)

    Alan

  3. Robbin Says:

    Alan, this was so great that I edited the post — I am hoping that everyone else will read your comment, too.

    Robbin

    ps I really don’t get credit here. Steve gets the credit.

  4. Online Marketing Blog » Blog Archive » Funnel Problems in Google Analytics Says:

    [...] can avoid problems with a step matching subsequent steps by using regular expression that have negative lookaheads to exclude the later [...]

Leave a Reply