Upcoming LunaMetrics Seminars
New York City, Aug 4-8 San Francisco, Aug 11-15 Los Angeles - Anaheim, Sep 8-12 Washington DC, Sep 22-26

Regular Expressions for GA Bonus 1: {Braces}





regular-expressionsBefore I start – The “Criticize GA Documentation” contest ends on Tuesday, June 26.

We now take a break in our regularly-scheduled programming (which was filters for GA). That’s because I need to return to an old topic, Regular Expressons (RegEx) for GA, and add a much-needed post: Regular Expression Braces.

Braces are curly brackets, like this {these are braces}. GA never mentions them. So, I don’t know if they are an unsupported feature, or a problem with the documentation.

Braces repeat the last “piece” of information a specific number of times. They are used with two numbers, like this: {6,8}. That particular example means, repeat the last piece of information at least six times and no more than eight.

For example, there is a place across the street here in Honolulu called the Rainbow Bazaar. If I wanted to pull a report with all the correct spellings of their name, I could search the report (in the little box at the bottom of the page in the new GA version). I would use the following RegEx:

baza{2,2}r

This means, pull all the keywords that have a baz followed by at least two and no more than two a’s and which are also followed by an r. Hence, bazaar. (Notice that the last letter is my last piece of information.) Or I could use those same braces to pull misspellings, a more interesting report.

The problem with regular expressions is always knowing what they are “working on.” In this case, what is the last piece of information? A set of square brackets or parentheses would make a piece of information. (And in fact, a great use of braces would be to capture all the IP addresses in a block of 0-255, like this: [0-9]{1-3} . It’s true that you will also capture 538 and 627 and all sorts of numbers above 255, but you really don’t care, since the IP block will never go higher than 255, anyway.) In the absence of a well-defined piece of information (defined by parentheses or brackets), you are working with the last character.

Here are all the other RegEx posts:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
Minimal Matching
Lookahead

Robbin

Robbin Steif

About Robbin Steif

Our owner and CEO, Robbin Steif, is an analyst herself. She is a graduate of Harvard College and the Harvard Business School, and has served on the Board of Directors for the Digital Analytics Association (formerly the Web Analytics Association.)

http://www.lunametrics.com/blog/2007/06/25/regular-expressions-for-ga-bonus-1-braces/

6 Responses to “Regular Expressions for GA Bonus 1: {Braces}”

Vinny says:

Wow, as someone new to GA, your articles on custom filters have been a real eye-opener for me.

You are probably aware of this, but I thought I would point you out to a great regular expression tool that I use in all my development efforts (I am a web application developer by trade). It’s called “The Regex Coach” and can be downloaded at http://weitz.de/regex-coach/

It allows you to enter your regular expression in the “Regular expression” pane, and then test it across different strings that you enter into the “Target string” pane.

(note: Javascript regular expressions don’t play nice with usual regexp conventions. So if you have a bit of regexp operating great in The Regex Coach it will 99% likely work in your PHP/Perl/C++/etc code, but test thoroughly within your Javascript code).

Chewy says:

so so this green newbie wants to know how would I match “fish” but not “fishing” or “box” but not “boxes”?

Robbin says:

Hi Chewy.
Try this post:
http://www.lunametrics.com/blog/2007/08/08/regular-expressions-for-ga-bonus-iii-lookahead/

…. and then tell me what you think. Look especially at the negative lookahead that Alan wrote about in the comments to that post.

Robbin

RegEx Jedi Master says:

@Robin

This post has errors…

* “1-3 to “1,3″
e.g. [0-9]{1-3} to [0-9]{1,3}

* .{2,2} to .{2}
e.g. baza{2,2}r to baza{2}r

* 0 to 999 would be [0-9]{1,3}
e.g. 0 to 255 try [0-2][0-5]{2} but this would Not match “166″, so a better expression would be…
^([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
http://www.google.co.uk/support/googleanalytics/bin/answer.py?answer=55572

Thanks

Phil.

Also interested to here you thought on this expression…

Referral ^https?://www\.google\.(.{2,7})/([?#]hl=|search|webhp|url).*[?#&]cd=([1-9]|[1-9][0-9]).*&q=([^&]+)&(?!oi=(spell|revisions_inline|broad-revision|social_search|blog_result|microblog_result|sideways_refinements))

Good luck ;-)

Robbin says:

Good catch! I never saw that first typo. The second one, {2,2} vs {2}, you should be able to do either way, but you are right that just a single number is more elegant.

WRT your regex, I have never seen a referring site come through with https or http:// so making it mandatory with the beginning caret^ is going to mess you up. I hope that is the advice you were looking for.

RegEx Jedi Master says:

Robin,

Thanks for looking at my RegEx example.

Although I don’t think your reply is correct… I would say that all referrals contain “https” or “http://”

For instance if you create a new profile, and then add this custom filter:
Referral:(.+)
User-defined:$A1

Wait 24hours…
Visitors Tab > User-defined > Contains box ^https?://
…then you will see data.

Maybe you are thinking of the Source field, which does not contain the protocol. e.g. “blog.domain.com”? The referral for this domain would show “http://blog.domain.com/page.htm”.

Thanks anyway.

Phil.

Regarding “Minimal Match” or aka “Lazy Match” rather than Greedy Match. Maybe some more examples would be useful e.g.

match first instance of “&”
\?paramId=(.+?)&

match anything before the first instance of “&”
\?paramId=([^&]+?)
same as \?paramId=([^&]+)&

match 2 to 4 characters (more efficient method for GA import engine, but no difference in result).
\?paramId=(.{2,4}?)
Ref:http://www.regular-expressions.info/reference.html (link include some RegEx not support by GA RegEx)