Upcoming LunaMetrics Seminars
New York City, Aug 4-8 San Francisco, Aug 11-15 Los Angeles - Anaheim, Sep 8-12 Washington DC, Sep 22-26

Regular Expressions Part XII: Now Let’s Practice





Now that I have learned and then explained all the Regular Expressions for Google Analytics:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

let’s work backwards, i.e. look at some expressons and figure out what they mean and why.

regular-expressions

I always hate when techies give a really simple explanation and then jump to the hardest example possible, so I will try not to do the same (which is easy for me, not being a techie and all.) Let’s start with these warm-up examples from the Wikipedia entry on Regular Expressions.

  • “.at” matches any three-character string like hat, cat or bat. Reason: Because a dot matches any character. So hat, cat and bat are all good matches, as would be any other one character match to the dot.
  • “[hc]at” matches hat and cat. Reason: Because square brackets create a list of items, and you can match to any one item in the list. So this expressions matches hat by pulling the “h” out of the square brackets, and it matches cat by pulling the “c” out of the square brackets, but unlike the former example, it doesn’t match bat — that’s because there is no “b” in the square brackets.
  • “[^b]at” matches all the matched strings from the regex “.at” except bat. Reason: This is an alternative of the carat ^ – when it is inside square brackets at the beginning, it means “not.” Thus, the [^b] means, don’t match a b.
  • “^[hc]at” matches hat and cat but only at the beginning of a line. Reason: This is a more standard use of the carat ^ — it is not inside square brackets so it means, the RegEx will match your expression only if your expression starts at the beginning of the line.
  • “[hc]at$” matches hat and cat but only at the end of a line. Reason: This is identical to the second example in this list, except for the dollar sign at the end. The dollars sign ensures that the RegEx only matches your string if your string’s characters come at the end of a line.

OK, here is a slightly harder one, also from Wikipedia:

((great )*grand)?((fa|mo)ther)

I will take this apart to make it easier to understand.

The parenthesis create groups, separated by a question mark. So we effectively have:

(expression in this set of parenthesis)?(another expression in this set of parenthesis)

Since a question mark usually means, include 0 or 1 of the former expression, we know that this RegEx is allowed to match just the stuff in the second set of parenthesis (right? That’s what question marks do, they can match what comes right before them or not match what comes right before them. If they don’t match the stuff before them, only the characters after them are left to match.) So, let’s start by looking at the second half only, which we know should be able to stand by itself:

((fa|mo)ther)

The pipe symbol | means OR. So this resolves to (father) OR (mother). You might reasonably ask, why do we need all the parentheses? Technically, we don’t need the outside set but they make the expression easier to read when it is all together like this: ((great )*grand)?((fa|mo)ther) It would be perfectly reasonable to write an expression like this: (fa|mo)ther. We do need the inside set because if we got rid of them, the expression would look like this: fa|mother , which means, either fa OR mother.

Now let’s go back and look at the first half, the part that came before the question mark:

((great )*grand)

The star tells us to match zero, one or more than one instances of the expression before it. So it can match a string which doesn’t include great, in which case we just have grand, and of course, we always have the end of the expression, which will either be mother or father. So we might match to grandmother or grandfather. It can match a string which includes great just once, in which case we have great grand mother OR great grandfather. And it can match a string which includes great more than once, so we might end up with great great great great grandmother OR great great grandfather.

So there you have it, all your ancestors with just one Regular Expression.

If you didn’t understand any of that, please send me email, steif -at- lunametrics.com. (I am always disappointed that no one ever comments on RegEx posts.)

Robbin

Robbin Steif

About Robbin Steif

Our owner and CEO, Robbin Steif, is an analyst herself. She is a graduate of Harvard College and the Harvard Business School, and has served on the Board of Directors for the Digital Analytics Association (formerly the Web Analytics Association.)

http://www.lunametrics.com/blog/2006/11/27/regular-expressions-part-xii-now-lets-practice/

25 Responses to “Regular Expressions Part XII: Now Let’s Practice”

Des says:

Hi Robbin,
I was looking for a “not” operator for quite a while, I had worked it out by trial and error and then found this. Can you tell me this much…
would

view\.php\?type=[^(misc)]

Match all urls of the form view.php?type=Anything, except for this one…
view.php?type=misc

This is a great series of posts by the way, thanks for it

Des

steve says:

Hi Des,

Tricky. Very tricky. In that it requires use of an advanced RegEx that most people really don’t need, use or understand. :-)

1st off: view\.php\?type=[^(misc)]
won’t do what you want. The use of the square brackets is for a grouping of characters in any order.
ie your RegEx will match anything without the letters ‘i’ or ‘s’ or ‘c’ or ‘m’ or ‘)’ or ‘(‘ as the first character after the “type=”.
eg. view.php?type=) or …type=m and so on. Being the ones that won’t match.

The Quick answer is:

view\.php\?type=(?!misc)

BUT!!!! And it’s a big but. I have no idea if GA can actually understand that construct (“negative lookahead” if you’re interested).
Suggest creating a test profile, use that as the only filter and see if you can make it trip or not by hitting your website.

In essence, this says we have a successful match if we DON’T have a ‘misc’ following the “…type=”
The use of brackets is… special/different in this situation. Be. Ware. :-)

Robbin, if you want to chase me on this, I can take you through in more detail. But, I almost *never* use this construct. It’s useful, but can be very confusing to use.
Cheers!

Des says:

Hi Steve,
Thanks for your solution. I’ve tested it by using it to search content in G.A, and it works.

I had presumed that you can concatenate letters inside the square brackets for form strings. Clearly not, here is a case where I mourn the lack of programmers manual. (I’ve used reg exs before in perl, documentation is a tad richer etc).

Anyways, many thanks,

Des

Renato says:

My website uses PHP on a framework called Zend Framework witch make my url parameters look like subdirectories. My url looks like http://www.mysite.com/section/module/page/paratemer1/value1/parameter2/value2/ instead of http://www.mysite.com/section/module/page.php?paratemer1=value1&parameter2=value2. My problem is that Urchin is considering that my parameters are subdirectories and the reports are not usefull because I can’t know my most access pages and other reports.

Do you know how can I resolve this in order to Urchin consider just the 4 first directories and sum the rests? For example, in this case, I would like Urchin to consider every access to after 4 sub-directories:www.mysite.com.br/section/module/page/pagename to a pageview of the pagename.

schemogroby says:

Nice Examples. Much more practical/understandable than the official google online “explanation”. Thanks for sharing!

Robbin says:

Thanks. If you click on all the pictures of the regular expressions in the google online explanation, you’ll find that we (just) wrote the backup explanations there. So that is getting better, too. Plus, I think G is trying much much harder to do better doc work….

J.Naveen says:

Hi, this is very interesting tutorial, it is very helpful for the beginners. I saw that negative look ahead example (?!misc). I have a doubt in that example. I practiced with another example but I am not getting positive result So please help me for this expression

Ex: I have data like this….

1. tf_PADD1.setText(tf_ADD1.getText());
2. tf_PADD2.setText(tf_ADD2.setText());
3. tf_PCITY.setText(tf_CITY.getText());
4. cm_PDISTRICT.setSelectedItem(cm_DISTRICT.getSelectedItem());
5. tf_PPINCODE.setText(tf_PINCODE.setText());

in that above text I want to select lines which do not have the combination of setText and GetText(Targeted result 2&5 lines), for that I wrote an expression but that expression selects 1,2,3,5 line in that above text. First I thought I got understand this concept now I am in confusion, exactly what is the use of negativeahead(?!….)
I thought This expression is for not negating a word, So if it selects 1st word before getText then wht is the use?

My Expression is:
(?!getText)tf_.*setText

I am learning, how to negate a word.
So please help me, I am very eager to know where I am doing worng…………..

Robbin says:

J. Naveen –

Why don’t you try

set(?!.*get)

That says, match the expression if it includes set, as long as it is not followed somewhere down the line by get. You can make it more specific if you need to, but that would be the guts of the negative match.

David says:

I was not going to leave a comment (although I was tempted) but “I am always disappointed that no one ever comments on RegEx posts.” decided me upon.

So here is my addition:

On the first example:
“.at” matches any three-character string like hat, cat or bat. Reason: Because a dot matches any character. So hat, cat and bat are all good matches, as would be any other one character match to the dot.

I would replace it with that:
“.at” matches any three-character string like hat, cat or bat. Reason: Because a dot matches any character. So hat, cat and bat are all good matches, as would be any other one character match to the dot. By the way, it also matches .at, the ccTLD for Austria—yes, a dot includes any character, including the mundane form of itself.

Robbin says:

Good point. Thanks for the comment, David.

Brandon says:

Robbin: Your solution would indeed have to be altered, considering that from what I understand he wants it to return the entire lines?

I myself for this problem just cobbled together \d\. [^.]*\.set[^.]*\.[^g][^\d]* which though rather disgusting and while it could be shortened, does work.

Brandon says:

After accidentally enabling dot equals newline I discovered why my previous pattern is so gross. Here is another quickly cobbled together one. Any shortened versions would be appreciated so I may learn. I’m still new to regex.

\d\. t.*s[^g]*

Robbin says:

Brandon – I wrote most of this over two years ago. So you have to tell me which example you are talking about.

Brandon says:

G’day, Robbin. Must say I did not expect a response after seeing the last one was from a long time ago. I was referring to J.Naveen’s problem.

Brandon says:

After accidentally enabling dot equals newline I discovered why my previous pattern is so gross. Here is another quickly cobbled together one. Any shortened versions would be appreciated so I may learn. I’m still new to regex.

\d\. t.*s[^g]*;

Missed the ;. Works now.

Chris says:

Can you use regex in setting up a funnel in GA if you have selected Head Match for the Goal itself? Or does the Goal match type have to be set to RegEx in order to use Regex in the funnel setup?

Robbin says:

The goal type has to be RegEx, and you have to use RegEx in all the steps of your funnel. You can’t mix and match. HTH

Chris says:

Awesome, thank you!

Lukas says:

Now that is really a great help. Unfortunately, I am still not able to find a way to exclude certain expressions. For example, I would like to make a filter that includes all pages with the word “telephone” in the title, but not the word “telephone booth”. How can I do that?

Would be awesome if you could help or at least say if that is possible at all.

Thank you!

Lukas

John says:

Lukas,

You used to be able to do that with Negative Lookaheads. However, Google has disabled the functionality in GA.

John

Stephanie says:

Hi John,

We just started using GA on a flash site that has many tracking events. Since GA has disabled the functionality of negative lookaheads, is there another solution? I need to include based on the term “video” but exclude several other terms in the event such as “complete”, “dropoff”, “start”, “load”, etc.

Thanks,
Stephanie

Robbin says:

Stephanie, you are right about the lookahead, but if you send me the exact example, I can try. I have one idea, but I have to play with it a lot and would love your participation. You are welcome to email me directly, to my last name at lunametrics.com

Robbin Steif

Stephanie says:

Sorry for the late reply. Below is an example of the events with two of our videos.

VIDEO+-+Understanding+Your+Communication+Potential
VIDEO+-+FY10+SLO+Highlights+Video
VIDEO+-+COMPLETEUnderstanding+Your+Communication+Potential
VIDEO+-+COMPLETEFY10+SLO+Highlights+Video
VIDEO+-+DropOff+Understanding+Your+Communication+Potential+time+252
VIDEO+-+DropOff+FY10+SLO+Highlights+Video+time+452.785

So, I need to get the total for all video views so I search on include “video\+\-\+” but I also need to exclude all the videos with “COMPLETE” and “DropOff” because those are not individual views but rather more information on the views that happened.

Thanks in advance!

Robbin says:

Stephanie, you are looking for “negative lookaheads” which Google Analytics has disabled. Since those words, complete and Dropoff, are *always* after the +-+, there might be a way to do it, but I have to work on it. Robbin

John says:

Stephanie,

Are you trying to create a Profile with just these items? Are they pageviews or are they actual “GA Events”?

If you’re trying to create a profile with just these things in it, then you just need 2 filters.

INCLUDE – Request URI – “VIDEO\+\-\+”
EXCLUDE – Request URI – “COMPLETE|DropOff”

If you want to do it in existing reports going back, you may be able to do it with a combination of Adv Segments and a Report Filter, but that is less clear, and may be less accurate, but if that is what you’re looking for, let me know.