412.343.3692
1.800.975.1844

Regular Expressions Part XII: Now Let’s Practice

Now that I have learned and then explained all the Regular Expressions for Google Analytics:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

let’s can work backwards, i.e. look at some expressons and figure out what they mean and why.

I always hate when techies give a really simple explanation and then jump to the hardest example possible, so I will try not to do the same (which is easy for me, not being a techie and all.) Let’s start with these warm-up examples from the Wikipedia entry on Regular Expressions.

  • “.at” matches any three-character string like hat, cat or bat. Reason: Because a dot matches any character. So hat, cat and bat are all good matches, as would be any other one character match to the dot.
  • “[hc]at” matches hat and cat. Reason: Because square brackets create a list of items, and you can match to any one item in the list. So this expressions matches hat by pulling the “h” out of the square brackets, and it matches cat by pulling the “c” out of the square brackets, but unlike the former example, it doesn’t match bat — that’s because there is no “b” in the square brackets.
  • “[^b]at” matches all the matched strings from the regex “.at” except bat. Reason: This is an alternative of the carat ^ – when it is inside square brackets at the beginning, it means “not.” Thus, the [^b] means, don’t match a b.
  • “^[hc]at” matches hat and cat but only at the beginning of a line. Reason: This is a more standard use of the carat ^ — it is not inside square brackets so it means, the RegEx will match your expression only if your expression starts at the beginning of the line.
  • “[hc]at$” matches hat and cat but only at the end of a line. Reason: This is identical to the second example in this list, except for the dollar sign at the end. The dollars sign ensures that the RegEx only matches your string if your string’s characters come at the end of a line.

OK, here is a slightly harder one, also from Wikipedia:

((great )*grand)?((fa|mo)ther)

I will take this apart to make it easier to understand.

The parenthesis create groups, separated by a question mark. So we effectively have:

(expression in this set of parenthesis)?(another expression in this set of parenthesis)

Since a question mark usually means, include 0 or 1 of the former expression, we know that this RegEx is allowed to match just the stuff in the second set of parenthesis (right? That’s what question marks do, they can match what comes right before them or not match what comes right before them. If they don’t match the stuff before them, only the characters after them are left to match.) So, let’s start by looking at the second half only, which we know should be able to stand by itself:

((fa|mo)ther)

The pipe symbol | means OR. So this resolves to (father) OR (mother). You might reasonably ask, why do we need all the parentheses? Technically, we don’t need the outside set but they make the expression easier to read when it is all together like this: ((great )*grand)?((fa|mo)ther) It would be perfectly reasonable to write an expression like this: (fa|mo)ther. We do need the inside set because if we got rid of them, the expression would look like this: fa|mother , which means, either fa OR mother.

Now let’s go back and look at the first half, the part that came before the question mark:

((great )*grand)

The star tells us to match zero, one or more than one instances of the expression before it. So it can match a string which doesn’t include great, in which case we just have grand, and of course, we always have the end of the expression, which will either be mother or father. So we might match to grandmother or grandfather. It can match a string which includes great just once, in which case we have great grand mother OR great grandfather. And it can match a string which includes great more than once, so we might end up with great great great great grandmother OR great great grandfather.

So there you have it, all your ancestors with just one Regular Expression.

If you didn’t understand any of that, please send me email, steif -at- lunametrics.com. (I am always disappointed that no one ever comments on RegEx posts.)

Robbin
LunaMetrics

Share and Enjoy:
  • Digg
  • del.icio.us
  • StumbleUpon
  • Sphinn
  • Facebook

8 Responses to “Regular Expressions Part XII: Now Let’s Practice”

  1. Des Says:

    Hi Robbin,
    I was looking for a “not” operator for quite a while, I had worked it out by trial and error and then found this. Can you tell me this much…
    would

    view\.php\?type=[^(misc)]

    Match all urls of the form view.php?type=Anything, except for this one…
    view.php?type=misc

    This is a great series of posts by the way, thanks for it

    Des

  2. steve Says:

    Hi Des,

    Tricky. Very tricky. In that it requires use of an advanced RegEx that most people really don’t need, use or understand. :-)

    1st off: view\.php\?type=[^(misc)]
    won’t do what you want. The use of the square brackets is for a grouping of characters in any order.
    ie your RegEx will match anything without the letters ‘i’ or ’s’ or ‘c’ or ‘m’ or ‘)’ or ‘(’ as the first character after the “type=”.
    eg. view.php?type=) or …type=m and so on. Being the ones that won’t match.

    The Quick answer is:

    view\.php\?type=(?!misc)

    BUT!!!! And it’s a big but. I have no idea if GA can actually understand that construct (”negative lookahead” if you’re interested).
    Suggest creating a test profile, use that as the only filter and see if you can make it trip or not by hitting your website.

    In essence, this says we have a successful match if we DON’T have a ‘misc’ following the “…type=”
    The use of brackets is… special/different in this situation. Be. Ware. :-)

    Robbin, if you want to chase me on this, I can take you through in more detail. But, I almost *never* use this construct. It’s useful, but can be very confusing to use.
    Cheers!

  3. Des Says:

    Hi Steve,
    Thanks for your solution. I’ve tested it by using it to search content in G.A, and it works.

    I had presumed that you can concatenate letters inside the square brackets for form strings. Clearly not, here is a case where I mourn the lack of programmers manual. (I’ve used reg exs before in perl, documentation is a tad richer etc).

    Anyways, many thanks,

    Des

  4. Renato Says:

    My website uses PHP on a framework called Zend Framework witch make my url parameters look like subdirectories. My url looks like http://www.mysite.com/section/module/page/paratemer1/value1/parameter2/value2/ instead of http://www.mysite.com/section/module/page.php?paratemer1=value1&parameter2=value2. My problem is that Urchin is considering that my parameters are subdirectories and the reports are not usefull because I can’t know my most access pages and other reports.

    Do you know how can I resolve this in order to Urchin consider just the 4 first directories and sum the rests? For example, in this case, I would like Urchin to consider every access to after 4 sub-directories:www.mysite.com.br/section/module/page/pagename to a pageview of the pagename.

  5. schemogroby Says:

    Nice Examples. Much more practical/understandable than the official google online “explanation”. Thanks for sharing!

  6. Robbin Says:

    Thanks. If you click on all the pictures of the regular expressions in the google online explanation, you’ll find that we (just) wrote the backup explanations there. So that is getting better, too. Plus, I think G is trying much much harder to do better doc work….

  7. J.Naveen Says:

    Hi, this is very interesting tutorial, it is very helpful for the beginners. I saw that negative look ahead example (?!misc). I have a doubt in that example. I practiced with another example but I am not getting positive result So please help me for this expression

    Ex: I have data like this….

    1. tf_PADD1.setText(tf_ADD1.getText());
    2. tf_PADD2.setText(tf_ADD2.setText());
    3. tf_PCITY.setText(tf_CITY.getText());
    4. cm_PDISTRICT.setSelectedItem(cm_DISTRICT.getSelectedItem());
    5. tf_PPINCODE.setText(tf_PINCODE.setText());

    in that above text I want to select lines which do not have the combination of setText and GetText(Targeted result 2&5 lines), for that I wrote an expression but that expression selects 1,2,3,5 line in that above text. First I thought I got understand this concept now I am in confusion, exactly what is the use of negativeahead(?!….)
    I thought This expression is for not negating a word, So if it selects 1st word before getText then wht is the use?

    My Expression is:
    (?!getText)tf_.*setText

    I am learning, how to negate a word.
    So please help me, I am very eager to know where I am doing worng…………..

  8. Robbin Says:

    J. Naveen –

    Why don’t you try

    set(?!.*get)

    That says, match the expression if it includes set, as long as it is not followed somewhere down the line by get. You can make it more specific if you need to, but that would be the guts of the negative match.

Leave a Reply