Archive for the ‘Regular Expressions’ Category
Posted on July 27, 2010 by Robbin Steif
A couple months ago, I published a RegEx for Google Analytics ebook. You can download the ebook or just “page through it” online . The last page was a quiz, and I promised the answers — here they are:
- Question 1: Write a Regular Expression that matches both dialog and dialogue.
- Answer: Since you should always keep your RegEx simple, my best answer would be dialog|dialogue . Slightly more elegant but more complicated would be dialog(ue)? I’ll just invite others here (and for every other answer) to submit alternative ideas.
- Question 2: Write a RegEx that matchest two request URLs: /secondfolder/?pid=123 and /secondfolder/?pid=567 (and cannot match other URLs)
- Answer: ^/secondfolder/\?pid=(123|567)$ — note, I chose to put a dollar sign at the end so that the regex stopped matching if anything came after the three digits at the end of the target strings.
- Question 3: Write a single Regular Expression that matches all your subdomains (and doesn’t match anything else). Make it as short as possible, since Google Analytics sometimes limits the number of characters you can use in a filter. Your subdomains are subdomain1.mysite.com, subdomain2.mysite.com, subdomain3.mysite.com, subdomain4.mysite.com, www.mysite.com, store.mysite.com and blog.mysite.com
- Answer: There are lots of ways to do this. My choice would be: (subdomain[1-4]|www|store|blog)\.mysite\.com$
- Question 4: Write a funnel and goal that includes three steps for the funnel and the final goal step (four steps in all), using Regular Expressions. Notice that there are two ways to achieve Step 2.
- /store/
- /store/creditcard or store/moneyorder
- /store/shipping
- /store/thankyou
Answer: This one requires a picture:
The RegEx for Step 2 is hard to read in the picture above, it is /store/(creditcard|moneyorder)$ . Notice that I added a dollar sign at the end of all the expressions above. Without context, it is hard to know which ones are mandatory, but we know for sure that the dollar sign in Step 1 is mandatory. That dollar sign is required, not because of regular expressions, but because of the strange way funnels sometimes work. For more information on that one, you might want to read this excellent post on goals and funnels that Jonathan Weber of LunaMetrics wrote.
Finally, let me end with a great quote! (I didn’t ask permission to use her name, so I will just use an initial.) This reader wrote me yesterday to ask, where were the answers? Here is part of what she wrote:
Seriously, your ebook was so full of win. I’ve been trying to wrap my mind around RegEx and have been using the basics in GA for a while. But I really expanded my understanding b/c your explanations were so plain English and applicable to GA. A lot of tutorials I saw online weren’t written for GA and were written by propellerheads, I suspect.
Having worked for a graphics publishing company, I also REALLY appreciated the layout, font treatments, and graphics. A world-class job all the way around. — A
– Robbin
View Comments (12 Responses) | Categories: Regular Expressions
Posted on May 20, 2010 by Robbin Steif
Last year, when I started working on an eBook for Google Analytics and Regular Expressions, one of my acquaintances wrote, “That’s so 2008.” (And just think, now it is 2010.)
So I put it all on the shelf for a while, until Nick M and Avinash did this video and addressed RegEx (Regular Expressions) again. Hmm, I thought. Well, I use them all the time. And people write me with their RegEx and say, “Please help me troubleshoot them” all the time. And then I saw a plea for help on a bulletin board. And finally, when Nick and Avinash did that Nick-and-Avinash show referenced above, I thought, time to finish this ebook.
So here is my guide to Regular Expressions (including the cartoon characters) . You can download it, or read it in html.
I know that there is one design error, but I don’t want to fix it yet again until a lot of you RegEx fans get a chance to read and comment.
All thoughts are welcome. And remember what David Meerman Scott says: On the Internet, you are what you publish.
Robbin
View Comments (8 Responses) | Categories: Regular Expressions
Posted on August 13, 2009 by Robbin Steif
This week, I am feeling the same kind of pain that those who come to our training sessions often feel. What do we do with all that data, I sometimes hear. What does it all mean?
Doing an analysis “cold.” I feel the pain because we have created a benchmark for a special set of websites, and now I have to do the evaluation. It was easy to do the overall evaluation, but when I sat down to do individual evaluations, I often found myself scratching my head. This was a particularly hard exercise, because I don’t work with these sites on a day to day basis and don’t keep up with the analytics — I approached them “cold,” you might say.
Asking the right questions. What I found was that when I happened to ask the right questions and look in the right places, I often came away with incredible nuggets of information. Golden data, you might say. But it wasn’t linear (first you look at this report, then you look at that report, etc. Nope, not like that at all.) There was a decent amount of luck involved. I was also able to bring a lot of experience from other sites that helped me ask the right questions. And I had the good fortune to be surrounded by others who knew the project and were able to make suggestions.
Patience, patience. So be patient with yourself. Maybe you can start by looking at your own benchmarking (in Google Analytics, it is at the top of the Visitor’s section of the navigation on the left) and ask, “Why are we doing well (or not doing well) in each of these areas? I think web analytics includes discovery, but first, you have to get to know the territory. My current project requires me to work with the analytics cold, but many of you have the luxury of working with the same site and same analytics every day. Get to know them.
View Comments (No Responses) | Categories: Regular Expressions, Web Analytics
Posted on July 23, 2009 by Jonathan Weber
The filter at the bottom of reports can be quite handy. But you may have noticed that (like lots of AJAX-y things on the web these days) the URL doesn’t change when you apply this filter.

This can be a pain when you want to bookmark it or share the filtered report with someone else. (Sure, you can email them the report, but the static output isn’t the same as letting them play with it live.)
So you may be interested to know that you can actually apply the filter with a query parameter in the URL. (This may be of interest to anyone who’s writing Greasemonkey scripts or something like that, as well.)
The query parameter is q and you can use whatever expression you would use in the regular filter box. (Just make sure you URL-escape any characters that aren’t safe in URLs.) There’s also a query parameter qtype that specifies the value of the “containing” or “excluding” dropdown. Set qtype=0 for a “containing” filter and qtype=1 for an “excluding” filter. Just add these parameters to the URL for the report that you already have, and voila! you have a filtered report.
View Comments (4 Responses) | Categories: Google Analytics, Regular Expressions
Posted on May 9, 2008 by Robbin Steif
Do you do this Regular Expression stuff in Google Analytics? If you don’t, I would really like your help. (Sorry, this is for newbies only.)
I worked with the documentation pros at GA to improve the Help Center documentation on Regular Expressions, but I can no longer look at it through newbie eyes. I still worry that the uninitiated will read and and think, “Huh?” but maybe I am wrong. I wrote another article, but the documentation people said, “Robbin, we think we have enough.” And maybe they are right. (That’s the part that I don’t know.)
So please go to this page in the GA Help Section (if you are new at it), and send me email, telling me how new you are to this, and whether it made enough sense for you to be able to get started with Regular Expressions. Send me email so that others won’t be influenced by your thoughts. Remember, the less you know about this, the more valuable your thoughts are.
View Comments (7 Responses) | Categories: Regular Expressions
Posted on January 16, 2008 by John Henson
The Problem
Do you sometimes want a second report filter at the bottom of your keyword reports?

I do. Every time I want to see phrases containing a particular word, but not containing branded keywords. For example, I might want to see all searches containing “conversion”, but not “lunametrics conversion.” It is a recurring theme for me.
The Solution
So what do you do? It’s called a Lookahead, and it comes in two versions. Positive Lookahead and Negative Lookahead. And we can chain together as many as we want.
Let’s say I want to find all keyword phrases that contain conversion and contain website but do not contain lunametrics. Here is what I’d type into the report filter:
^ (?=.*conversion)(?=.*website) (?!.*lunametrics).*$
The (?! begins a negative lookahead (must not match) and a (?= begins a positive lookahead (must match). The regular expressions inside the lookaheads can be as complex as you want. But if you just want to follow the formula I used here. . .
The Recipe
1. Start with ^
2. Place each word you don’t want inside a Negative Lookahead : (?!.*word)
3. Place each word you do want inside a Positive Lookahead: (?=.*word)
4. Chain together as many of each as you want
5. Finish up with .*$
Caveat
This example will match anything with “website” in it anywhere. If you want to match exactly “website” and not “websites” or “123website”, use (?=.*\bwebsite\b) instead.
View Comments (13 Responses) | Categories: Regular Expressions
Posted on January 2, 2008 by John Henson
Do long search phrases convert better?
This was what I wanted to find out for a particular client, but it took some work. I used a regular expression in the Keywords Report of Google Analytics to filter by the number of terms in the Keyword Phrase. The exported results showed a clear increase in conversion rate as the number of search terms increased.
This client was doing far better with searchers who were using a lot of terms. They were being specific! They knew just what they were looking for and were ready to buy. This data put additional power behind recommendations concerning content, search engine optimization and paid search strategies.

1 .59%
2 .60%
3 .90%
4 1.17%
5 1.06%
6 1.22%
7 1.88%
8 3.33%
Even though there were a lot of people using long search phrases, this data was obscured. As the number of terms increased, the number of people searching for exactly that phrase decreased. This resulted in none of the individual phrases seeming to count for much. The so-called Long Tail.
You really have to dig to find these sorts of gems but they are invaluable in the pursuit of providing information that can be acted upon.
A tool for digging
The tool is a Regular Expression, a pattern matching language. If you’re not already familiar with it, there is a great series of articles right here on the LunaMetrics blog.
Here is what I used:
^([\\+*"*\\s*,*'*\\-*]*\\w+\\b\\s*[\\+*"*\\s*,*'*\\-*]*){3}$
It accounts for the most common characters I’ve found between words.
Steve (see comments) pointed out a great way to shorten my expression by using the \W character set. Here is what it looks like.
^(\\W*\\w+\\b\\W*){3}$
\W is shorthand for all non-word characters
How do I use it?

I know this may look like gibberish but keep reading — you don’t need to understand it to get some use from it.
In Google Analytics, go to Traffic Sources > Keywords and paste the Regular Expression into the box at the bottom of the data. Just change the {3} to whatever number of terms you want to see and click the GO button.
A brief look at the RegEx
Although this is not strictly a Regular Expression post, I feel obligated to include a basic glance at the different parts of the expression. Feel free to skip this if you just don’t care.
^ anchors the beginning of the match to the beginning of the string
( ) used to group a set of items together for a match
[\+*"*\s*,*'*\-*]* This group matches any number and any order of + ” , – ‘ and whitepace (\s). It is what handles all the characters that might end up separating different search terms.
\w+ Matches 1 or more alphanumeric characters (the \w is another pre-defined set of characters like \s)
\b Match for a word boundary. It forces the \w characters to be separated by something. Otherwise the expression will match any string of characters longer than {3}.
{3} Requires exactly 3 of the above sequence so it would match the phrase one two three but not one two three four
$ anchors the end of the match to the end of the string
Don’t Sweat the Small Stuff
You can’t account for every situation. For example, sometimes ‘ is meant as an apostrophe and sometimes “-” is used as a hyphen. In the end the impact is usually small – just 2-3% of the search phrases were affected in my case and they just get bumped to the next higher match instead. (For example, non-glare window would match at {3} instead of {2})
It is an interesting way to look at keyword data and maybe you’ll get some use from it– if you do, let me know.
View Comments (18 Responses) | Categories: Conversion Science, Google Analytics, Regular Expressions
Posted on August 8, 2007 by Robbin Steif
I hope this will be the last installment, for some time to come, of my Regular Expressions (RegEx) for Google Analytics series. At the end of the post, I have finally threaded all the RegEx posts.
Tonight you can see some very cool (if you care) regular expressions for GA that look ahead and decide whether the match is allowed. Sort of, conditional match. There are two kinds of look aheads: negative lookahead (don’t match if) and positive lookahead (only match if …) Like braces, this is a ReGex that works in Google Analytics, but there is no documentation on it.
Here’s an example that I had today. I was working with a GA site where they have membership, not customers. The site owners use the word members in some of their URIs, and they also use membermail and memberthis and memberthat and membertheother. And a whole lot of other memberexamples.
So let’s say that I needed to create a filter with a regular expression to include all those member URIs in GA, but I want to make sure that I don’t include membermail, since mail is in a special category in my marketing mix. Now we’re in a position to formulate the question really well: how do we include all the Request URIs that have the string member in them where that string is not followed by the string mail? In other words — don’t match member if it is followed by mail.
Steve gets the credit for this one. He suggested that GA might work with negative lookahead – the ability to combine regular expressions to say, “don’t match if it is followed by…” In our membermail example, the expression would be member(?!mail) .
The opening of the parenthesis, followed by a question mark, tells the RegEx engine, “Watch out, lookahead coming.” The exclamation point says, “And it’s a negative.” Combined, they mean, it’s a negative lookahead – don’t match the first part of the string if the second part is there. Don’t match member if it is followed by mail .
GA also handles positive lookahead. So if we want to match only membermail and not memberthis or memberthat and all the other uris with member, we can write our RegEx like this: member(?=mail) – in this case the open paren and the question mark do the same thing (“Watch out, lookahead coming,”) but the equal sign says, “And it’s a positive match.”
There is one last little fine point to wrap your head around, assuming you are not dizzy already. The lookahead string, or whatever you want to call the string mail in my example, is not part of the match. I know this sounds like gibberish, so let me give a last example. This one’s for all you RegEx fans. And for everyone on the Paris metro:
Example: Let’s say that I am doing a positive lookahead like this: member(?=s)hip. This means, match to member only if it is followed by an s, and then please match to the hip, too. . However, the string membership would not be a match. That seems a little ridiculous. After all, it is member, followed by an s, followed by hip, right?
Well, it doesn’t work that way. That’s because the s is only a condition. In the eyes of the RegEx engine, you sort of have a conditional regex that looks like this:
memberhip (notice, no s)
And, you are trying to match to
membership
It’s not a match, because the s isn’t part of the RegEx. It was just being used as a condition.
OK, we are done for tonight, and
Late note: A reader here, Alan, wrote a wonderful comment, whereby he shows other ways to implement and use the power of positive lookahead. You should read his comment, but the short version is, you can use positive lookahead to match if the “condition” is somewhere down the line. The “lookahead string” — the condition — doesn’t have to *immediately* follow the (?!) if you use the syntax that he figured out. (See, now you have to read his stuff….)
Here is the thread with all the RegEx posts.
Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
– Robbin
View Comments (12 Responses) | Categories: Regular Expressions
Posted on July 29, 2007 by Robbin Steif
Here it is … a straggling installment on my year-long saga to bring Regular Expressions (RegEx) to regular people. Today, I want to write about how to make your expressions less greedy. It’s a feature called “minimal matching.” Unfortunately for you advanced learners, you have to wait until the end of the article until you see the application. I always want to teach with the easiest of examples (however meaningless they may be…)
Last winter, I wrote about how RegEx are greedy. But Google Analytics uses a flavor of RegEx that allows you to make them less greedy.
Let’s start with a fairly stupid but easy to understand example. We’ll assume that the phrase we want to match to is baaaaa. We could create a regular expression like this:
ba+
That means, start matching at the b, match the a next, and then match one or more a’s. (Right? The plus sign means, match one or more of the previous expression, which is so often, just the previous character.) So this RegEx that I wrote, ba+, will match the entire string:
baaaaa
(If that doesn’t make sense to you, remember that RegEx are greedy, and the default is, they match as much as they possibly can. Here, they will match all the a’s we give it, since the plus sign means, one or more.)
But what if we wanted to match just the first b and the first a, like this: ba
We need to specify minimal matching (and hang in there with me, soon it will all be worthwhile.) We can match to just the first instance of a by using a question mark. Like this:
ba+?
When Justin taught this to me (months ago), I asked, “How does the RegEx know that the question mark is doing minimal matching, and not there to say, match zero or one of the last character?” Well, it turns out that those are the same thing — the question mark in the expression above tells the plus sign, “You can match to one a but no more.”
This concept is actually very helpful when you are matching to URI strings. For example, let’s say you had a string like this:
gibberish/foldertwo/folderthree
And you wanted to rewrite that gibberish to be something meaningful. But because it is gibberish, it looks different every time. You could write a RegEx like this:
.*/
But, it will match to the last slash because the RegEx are greedy. However, if you use minimal matching, like this:
.*?/
it will match to only the first slash. And that’s what minimal matching is all about.
Plus signs +
Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
LookaheadÂ
Robbin
View Comments (12 Responses) | Categories: Google Analytics, Regular Expressions
Posted on June 25, 2007 by Robbin Steif
Before I start – The “Criticize GA Documentation” contest ends on Tuesday, June 26.
We now take a break in our regularly-scheduled programming (which was filters for GA). That’s because I need to return to an old topic, Regular Expressons (RegEx) for GA, and add a much-needed post: Regular Expression Braces.
Braces are curly brackets, like this {these are braces}. GA never mentions them. So, I don’t know if they are an unsupported feature, or a problem with the documentation.
Braces repeat the last “piece” of information a specific number of times. They are used with two numbers, like this: {6,8}. That particular example means, repeat the last piece of information at least six times and no more than eight.
For example, there is a place across the street here in Honolulu called the Rainbow Bazaar. If I wanted to pull a report with all the correct spellings of their name, I could search the report (in the little box at the bottom of the page in the new GA version). I would use the following RegEx:
baza{2,2}r
This means, pull all the keywords that have a baz followed by at least two and no more than two a’s and which are also followed by an r. Hence, bazaar. (Notice that the last letter is my last piece of information.) Or I could use those same braces to pull misspellings, a more interesting report.
The problem with regular expressions is always knowing what they are “working on.” In this case, what is the last piece of information? A set of square brackets or parentheses would make a piece of information. (And in fact, a great use of braces would be to capture all the IP addresses in a block of 0-255, like this: [0-9]{1-3} . It’s true that you will also capture 538 and 627 and all sorts of numbers above 255, but you really don’t care, since the IP block will never go higher than 255, anyway.) In the absence of a well-defined piece of information (defined by parentheses or brackets), you are working with the last character.
Here are all the other RegEx posts:
Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
Minimal Matching
Lookahead
Robbin
View Comments (6 Responses) | Categories: Google Analytics, Regular Expressions, Web Analytics