Upcoming LunaMetrics Seminars
New York City, Aug 4-8 San Francisco, Aug 11-15 Los Angeles - Anaheim, Sep 8-12 Washington DC, Sep 22-26

Regular Expressions for GA, Bonus III: Lookahead

NOTE: GA no longer supports negative lookahead. 

I hope this will be the last installment, for some time to come, of my Regular Expressions (RegEx) for Google Analytics series. At the end of the post, I have finally threaded all the RegEx posts.

Tonight you can see some very cool (if you care) regular expressions for GA that look ahead and decide whether the match is allowed. Sort of, conditional match. There are two kinds of look aheads: negative lookahead (don’t match if) and positive lookahead (only match if …) Like braces, this is a ReGex that works in Google Analytics, but there is no documentation on it.

regular-expressions
Here’s an example that I had today. I was working with a GA site where they have membership, not customers. The site owners use the word members in some of their URIs, and they also use membermail and memberthis and memberthat and membertheother. And a whole lot of other memberexamples.

So let’s say that I needed to create a filter with a regular expression to include all those member URIs in GA, but I want to make sure that I don’t include membermail, since mail is in a special category in my marketing mix. Now we’re in a position to formulate the question really well: how do we include all the Request URIs that have the string member in them where that string is not followed by the string mail? In other words — don’t match member if it is followed by mail.

Steve gets the credit for this one. He suggested that GA might work with negative lookahead – the ability to combine regular expressions to say, “don’t match if it is followed by…” In our membermail example, the expression would be member(?!mail) .

The opening of the parenthesis, followed by a question mark, tells the RegEx engine, “Watch out, lookahead coming.” The exclamation point says, “And it’s a negative.” Combined, they mean, it’s a negative lookahead – don’t match the first part of the string if the second part is there. Don’t match member if it is followed by mail .

GA also handles positive lookahead. So if we want to match only membermail and not memberthis or memberthat and all the other uris with member, we can write our RegEx like this: member(?=mail) – in this case the open paren and the question mark do the same thing (“Watch out, lookahead coming,”) but the equal sign says, “And it’s a positive match.”

There is one last little fine point to wrap your head around, assuming you are not dizzy already. The lookahead string, or whatever you want to call the string mail in my example, is not part of the match. I know this sounds like gibberish, so let me give a last example. This one’s for all you RegEx fans. And for everyone on the Paris metro:

Example: Let’s say that I am doing a positive lookahead like this: member(?=s)hip. This means, match to member only if it is followed by an s, and then please match to the hip, too. . However, the string membership would not be a match. That seems a little ridiculous. After all, it is member, followed by an s, followed by hip, right?

Well, it doesn’t work that way. That’s because the s is only a condition. In the eyes of the RegEx engine, you sort of have a conditional regex that looks like this:

memberhip (notice, no s)

And, you are trying to match to

membership

It’s not a match, because the s isn’t part of the RegEx. It was just being used as a condition.

OK, we are done for tonight, and

Late note: A reader here, Alan, wrote a wonderful comment, whereby he shows other ways to implement and use the power of positive lookahead. You should read his comment, but the short version is, you can use positive lookahead to match if the “condition” is somewhere down the line. The “lookahead string” — the condition — doesn’t have to *immediately* follow the (?!) if you use the syntax that he figured out. (See, now you have to read his stuff….)

Here is the thread with all the RegEx posts.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

– Robbin

Robbin Steif

About Robbin Steif

Our owner and CEO, Robbin Steif, is an analyst herself. She is a graduate of Harvard College and the Harvard Business School, and has served on the Board of Directors for the Digital Analytics Association (formerly the Web Analytics Association.)

http://www.lunametrics.com/blog/2007/08/08/regular-expressions-for-ga-bonus-iii-lookahead/

13 Responses to “Regular Expressions for GA, Bonus III: Lookahead”

Alan says:

Reporting live from the Paris métro…
(specifically: RER Ligne A, between Ch

Alan says:

Reporting live from the Paris metro¦

{Note from Robbin — Many years later, I deleted Alan’s excellent comment. This was because a) GA no longer supports negative lookahead and b) the Microsoft characters and his french keyboard, despite his Scottish ancestry and US education, polluted the comment so badly that when asked to clean it up, I couldn’t even read it. Apologies. – Robbin}

Alan

Robbin says:

Alan, this was so great that I edited the post — I am hoping that everyone else will read your comment, too.

Robbin

ps I really don’t get credit here. Steve gets the credit.

[...] can avoid problems with a step matching subsequent steps by using regular expression that have negative lookaheads to exclude the later [...]

[...] regolari avanzate con asserzioni negative che “guardano avanti”, proponendo anche un link di approfondimento. In realtà penso che nella maggior parte dei casi si possa ricorrere ad espressioni meno avanzate [...]

Open SEO says:

Nice example very clear.
However, Google Analytics team confirmed that they dropped support for negative lookahead in GA; because of some security risk they say.

DarrenJames says:

Now that looknegative lookaheads in GA are no longer available, is there any workarounds for this?

hugh gage says:

I have tried thested this in the top content report filter (where I test all regex against live data before setting up funnels) and all I get is a message telling me “there was an error fetching data for this view”

Thinkerati says:

Negative lookahead is no longer supported in Google Analytics from 2009 onwards, sadly.

Robbin Steif Robbin says:

Yup, thinkerati, you are right. But it was supported way back when this was written…

Rick says:

what are the other options to negative lookaheads? I need filter out any URL without one particular subdomain in it. The goal is achievable from everywhere on my site apart from one subdomain so I really need this :| ???

Mark C says:

Rick, I wish I knew. It seems negative lookahead is the most desired function (isn’t stripping stuff _out_ of a report the best way to focus your analytics activity?) and there is a strange silence on it from GA…

All4One_One4All says:

Fantastic articles on RegEx – been using RegEx for years, but learnt a lot, especially with better ways to ‘twig’ what it’s really doing.

Last article, under “12 Responses to “Regular Expressions for GA, Bonus III: Lookahead”
by “Allen” is too corrupt to read. Any chance of a clean copy.

Some, but only a few, CAN teach. It doesn’t take fancy qualifications and one can tell whether someone has it in just seconds. Congratulations on the best RegEx article ever!