Archive for the ‘Regular Expressions’ Category

Filters for GA, Part I: Get Ready with Profiles and Regex

I promised to write about Google Analytics (in this video). But first, I want to talk about profiles and Regular Expressions, because they will make your work so much easier.

Profiles. So you’re learning about filters, and you’ll probably make some mistakes. Join the crowd. But why make mistakes on the data that you’ve been using for a year now? Keep that “production data” holy, and experiment on a sandbox profile. Even if you think you are an expert at GA, always have at least once sandbox profile, and preferably two.

(Need to understand what profiles are? Well, certainly, you can use a profile within an account to measure a second website. But here, we aren’t talking about profiles for a new website, we’re talking about profiles for the same website. This is one of those concepts that is hard to understand at first, but is trivial once you get it. The idea is, you have multiple copies of your web analytics, all measuring the same thing, and if you set them up exactly the same, they will look exactly the same. However, you don’t have to set them up the same — you can keep one as your “good” copy, and the others can be used to learn. Need to learn how to configure a second profile?)

Having two clean (i.e. no filters) sandbox profiles will help you in a variety of ways: First, you don’t need to worry that the other filters on that profile are messing you up somehow. Second, they both start (one with and the other without the filter) at the same time, so when you write me and ask me why your filter doesn’t work, I promise I won’t ask if you chose a time period that pulled in unfiltered data. Third, since you won’t have taken yourself out of the data (because most people use filters for that, all except those who build special cookie workarounds), you can test it yourself doing all the strange things you’d like to check out.

Regular Expressions. Most filters require regular expressions. Now that I’ve gone through fourteen posts on Regular Expressions (RegEx) for Regular People (and specifically, for GA), I will be referencing that data. And if you already know it, you’ll think that this filter stuff is easy, easy.

Robbin

Regular Expressions question and GA: Search/Replace

This weekend, someone send me a Google Analytics Regular Expression (RegEx) question. The answer is pretty basic but interesting, and there is something to be learned about one of my favorite tools, the Epikone RegEx Tester.

Q: Hi,I’ve read most of your posts about RegEx, but I still can’t manage to find the right RegEx for one of my filters in GA.

I’d like to use a “search and replace” filter for all the pages whose URLs are either / OR /index.asp (which are in fact: www.my-domain.com and www.my-domain.com/index.asp). Basically, I’d like to have all the pages with both URLs displayed as “the page name I gave” in GA reports. This is why he wants to use the search and replace filter – to give the pages his chosen name. Robbin

I have tried several expressions on the RegEx filter tester but none of those seem to work. Note to Epikone: Notice that your tool is now elevated to “the” tester of choice. Robbin

I tried this one below, but I’m not sure that what the RegEx filter tester tells me means the filter is correct or not (I don’t fully understand how this tool works, especially for the “input string” and “result” fields). Here is the RegEx he is interested in:

^(/|/index\.asp)$

When I enter / in the input string, then click submit, the displayed result is Match: /,/

When I enter /index.asp in the input string, then click submit, the displayed result is Match: /index.asp,/index.asp

I don’t know what this result does mean exactly.

Could you tell me if this RegEx (^(/|/index\.asp)$) is correct regarding what I’m after, or if it’s wrong and then could you suggest me a working one ?

And here is my answer:

Robbin: Why don’t you first change your default page to be just index.asp. You can do this in settings > edit > then edit again. Telling GA that your default page is index.asp will stop you from getting a page like this / . This will help you with the search and replace AND help you read your analytics more easily.Then you can do it the simpleton’s way: ^/index\.asp (You really don’t need the dollar sign unless you have urls that end aspx, for example.)

I think if I were wanting to keep both / and /index.asp (a bad idea), my regex would be ^(/index\.asp)|/

It is really the same as yours, just a little simpler and easier to read.

The reason that the Epikone RegEx tester acts the way it does when you write it with parenthesis is that parenthesis tell GA, “I’ve created a variable.” And here, you can read what Justin the Man said about their RegEx tester and creating variables, I found this in old email from him:

Justin’s email: “Why our reg ex tester behaves the way it does. Our tester is pretty smart. If your expression matches the input string, then the tester will return the word ‘Match’ along with the part of the string that the expression matched. Now, if you are using parenthesis to store some part of the expression in a variable, the tool will return the value stored in the variable in addition to the part of the string that the reg ex matches.”

There is at least one other way to do this, too. You could go into the part of the code that reads urchinTracker(), on the homepage and make it urchinTracker(‘homepage’).

In the process of writing this, I found that there is a whole piece on the Epikone blog about how to interpret their results.

Robbin

Regular Expressions for Google Analytics: OK, I did it

I was finally inspired by Alan, who reads this blog using his Blackberry on the Paris Metro. (Truly a world wide web.) Also inspiring were the comments of an anonymous poster, who wrote that all the Regular Expressions (RegEx) posts were incredibly helpful, but couldn’t I please make them easier to access? So I did it, they are all beautifully threaded now. I even fixed awful typos in the Summary/Intro post and the ^ post (and you know, typos are the worst when you are a newbie learning a technical topic, you can spend hours trying to understand a topic, only to finally learn that the author was just lazy.)

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
Lookahead

Robbin

Intro to GA Regular Expressions: Part XIV of XIV

This is the last of fourteen posts I have done on Regular Expressions for Google Analytics. Now that I have learned them (and hopefully explained them), it’s time to have that introductory post (I always do like to work backwards.)

Here’s the reason. People skip the introductions to books, they don’t read manuals, they just want to figure out how to make “it” work, whatever it is.

Only once it works are we ready to say, OK, so what? Why do I care, what’s it good for, what’s it bad for?

What are Regular Expressions (RegEx)? They use characters on your keyboard (like * and ^), enabling you to create an expression that may or may not match a target expression. They have a strict set of rules — just like a programming language would — and it’s easy to make mistakes with them. (This is why I am a big user of this RegEx checking tool. ) So you will always have at least two expressions, the Regular Expression with the funny characters and the expression you are matching on your site, or in someone’s address, or keyword. Here’s a quick example of the Regular Expression vs. target expression issue: I can create a regular expression like this luna|robb?in and then match it against the keywords people used to come to my site to filter out all the times people used my company name or my own name, whether they spelled it right now not. In this example, the keywords were my target expressions. (Need to understand that pipe symbol in the RegEx? Need to understand the question mark in the RegEx?)

So why use them? The first reason that RegEx are worth caring about — if you use Google Analytics — is that Google cares. There are certain tasks Google just won’t let you do correctly without using RegEx. Examples that come quickly to mind are: take yourself out of the data using an IP address, creating a custom filter, creating a filter that enables you to see both your subdomain and your domain in the same profile. (The latter is just an example of the second example, a custom filter, but I mention it because you can read about it in the GA help section.)

Other great examples of custom filters: Create a filter to learn what words people actually type in to Google before they click on your AdWord, instead of just learning which AdWord gets credit. (I use this one all the time. The only hack I like better is this one.) Force all your reports to give you pages by title instead of URL.

OK, so we understand that they are needed for filters. But how about goals? After all, you can do a head match or an exact match, why go to the trouble of using a RegEx match?

One reason would be if you have two pages that are essentially the same goal or the same place in the funnel. So, for example, let’s say that when the visitor reaches either of these two pages, www.mysite.com/folder3/thanks.html and www.mysite.com/thanksalot.html, he has really achieved the same goal. By using RegEx, you can make your goal page in Google Analytics /thanks and whenever someone reaches either of those pages, the same goal (G1, or G2, or whichever one you choose) is incremented. Then if you happen to care about which page actually matters the most, you can easily go to the Content Optimization > Goals and Funnel Verification > Goal verification to see which page mattered the most.

Of course, if you have other pages on your site that match /thanks, you have to get more specific with your RegEx. However, I never forget the lesson that a friend taught me: keep your RegEx as simple as possible.

Another reason to use RegEx is when you are lazy. After all, just because there are a ton of variations, who wants to create a ton of iterations? The example above shows how to combine two into one, but what if you had 15 variations? Example: you have to be sure that your company is not in the analytics. Your company owns all the IP addresses from 72.77.12.26 through 72.77.12.40, inclusive. You sure don’t want to create 15 filters. Instead, you can use a regular expression like this: ^72\.77\.12\.(2[6-9]|3[0-9]|40) — it will capture all fifteen IP addresses. (The little carat ^ at the beginning says, don’t match if there is something before the 72. The backslashes \ turn the special dots into plain dots. This 2[6-9] means, match 26, 27, 28 and 29. This 3[0-9] says, match 30, 31, etc through 39. 40 says, match 40. The pipes are OR signs. So at the end of the expression, you’ll be matching to 26-29, OR 30-39 OR 40.)

Well, that’s it on RegEx for now. Here are all the prior posts in the series:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

Now that I am done with this series, I’ll go back and make sure that all the posts links to all the other posts consistently. Done! Done! All done!

Robbin
LunaMetrics

Regular Expressions Part XIII: Good Greed

This is my next to last post in this Regular Expression (RegEx) series. I have been thinking about this post for a long time and yesterday someone asked me a question (which finally got me to write this). She wrote that she had two pages that she wanted to roll into one Google Analytics goal. She created the Regular Expression for it, ran it through Epikone’s RegEx Coach, and it worked — but it wasn’t working in GA. (More on the Coach below.)

The two pages were:

subdomain.mysite.com/folder/subfolder/GoalThree.php
subdomain.mysite.com/folder/subfolder/GoalThreesome.php

She sent me a long, complicated expression which wasn’t working for her and asked my opinion.

This is absolutely a case of putting Good Greed to work for you, we will see in a minute. As I wrote in my last post, Regular Expressions are very greedy and they match everything unless you tell them not to. This is a very hard concept to wrap your head around — it means that, among other things, all the stuff before the expression and all the stuff after it gets matched to random things (unless you tell it not to. Or there is nothing to match to.)

Anyway, I wrote her back and said, why don’t you just write an expression like this:

/folder/subfolder/GoalThree

This assumes that she doesn’t have other GoalThreeVersions that will be incorrectly mixed in here. If, for example, she had another page, /folder/subfolder/GoalThreeCornered, that would qualify as a match too (because the RegEx matches everything it can, even if those characters aren’t in the Regular Expression.) Moving back to how simple her RegEx might be, she might even have been able to get away with a goal like this, depending on her site:

/GoalThree

This matches every expression that includes /GoalThree

Finally a word about the Epikone RegEx coach. I haven’t talked to Justin about this. But I am fairly sure that the coach is configured to check whether the phrase you type is a match to the RegEx you type, using the way GA interprets RegEx. That doesn’t mean that you necessarily come up with a valid goal, or an IP address that will actually filter anything. For example, you might use it to see if colou?r is a valid RegEx for color and for colour (it should be), but that doesn’t mean colou?r will necessarily work in your Google Analytics profile filters or goals. You really have to understand the context in which you are using the expression and what GA demands of you in addition to correctly configuring two expressions to match each other.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

Robbin
LunaMetrics

Regular Expressons Part XII: Bad Greed

Now that I have learned and explained the Regular Expressions that Google Analytics uses:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

I want to explore another area: Regular Expressions and the concept of greediness.

You might be tempted to write a Regular Expression like this:

/mypage/

expecting it to match the page on your site called /mypage/

And this Regular Expression really does match /mypage/. But it also matches /mypage/thirdpage-and-something-else . For that matter, it matches /secondpage/mypage.html and mypage.htm and mypage.asp.

Regular Expressions are greedy — they match and match as much as they can. Greed can be good, but first I want to write about the obvious problem, i.e. the RegEx (Regular Expression) will match too many strings to be useful.

We can deal with this in various ways:

1) Tell the RegEx where to start. In the above example, if we wrote the RegEx like this

^/mypage/

it will only match when /mypage/ is at the beginning of the line, so it will never match /secondpage/mypage/ etc.

2) Tell the RegEx when to stop. We can do this in various ways, but need to know how the expression ends. For example, are we only looking for mypage.htm or are we also looking for all the pages that are in the /mypage/ folder — /mypage/otherpages.htm? If only mypage.htm matters, then we can include that in the RegEx:

/mypage\.htm$

Notice that I used a backslash to make the dot into a real dot and not a special character, and a dollar sign to say, this only works at the end of the line. (That way, mypage.html won’t match.)

We can combine that with #1 above and create this RegEx:

^/mypage\.htm$

and never get any unexpected characters before the slash or after the htm.

On the other hand, if all the pages in the /mypage/ folder that have .htm suffixes are of interest, we could do this differently:

^/mypage/.*\.htm$

I really hate when people throw this kind of gobbledygook at me so let me see if I can explain in pieces:

^/mypage/ = only consider the match if /mypage/ is at the start of a line…
.* = match everything that comes next until…
\.htm$ = you get to the last real period followed by htm and it’s at the end of a line.

Clear as mud, eh?

Robbin
LunaMetrics

Regular Expressions Part XII: Now Let's Practice

Now that I have learned and then explained all the Regular Expressions for Google Analytics:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

let’s can work backwards, i.e. look at some expressons and figure out what they mean and why.

I always hate when techies give a really simple explanation and then jump to the hardest example possible, so I will try not to do the same (which is easy for me, not being a techie and all.) Let’s start with these warm-up examples from the Wikipedia entry on Regular Expressions.

  • “.at” matches any three-character string like hat, cat or bat. Reason: Because a dot matches any character. So hat, cat and bat are all good matches, as would be any other one character match to the dot.
  • “[hc]at” matches hat and cat. Reason: Because square brackets create a list of items, and you can match to any one item in the list. So this expressions matches hat by pulling the “h” out of the square brackets, and it matches cat by pulling the “c” out of the square brackets, but unlike the former example, it doesn’t match bat — that’s because there is no “b” in the square brackets.
  • “[^b]at” matches all the matched strings from the regex “.at” except bat. Reason: This is an alternative of the carat ^ – when it is inside square brackets at the beginning, it means “not.” Thus, the [^b] means, don’t match a b.
  • “^[hc]at” matches hat and cat but only at the beginning of a line. Reason: This is a more standard use of the carat ^ — it is not inside square brackets so it means, the RegEx will match your expression only if your expression starts at the beginning of the line.
  • “[hc]at$” matches hat and cat but only at the end of a line. Reason: This is identical to the second example in this list, except for the dollar sign at the end. The dollars sign ensures that the RegEx only matches your string if your string’s characters come at the end of a line.

OK, here is a slightly harder one, also from Wikipedia:

((great )*grand)?((fa|mo)ther)

I will take this apart to make it easier to understand.

The parenthesis create groups, separated by a question mark. So we effectively have:

(expression in this set of parenthesis)?(another expression in this set of parenthesis)

Since a question mark usually means, include 0 or 1 of the former expression, we know that this RegEx is allowed to match just the stuff in the second set of parenthesis (right? That’s what question marks do, they can match what comes right before them or not match what comes right before them. If they don’t match the stuff before them, only the characters after them are left to match.) So, let’s start by looking at the second half only, which we know should be able to stand by itself:

((fa|mo)ther)

The pipe symbol | means OR. So this resolves to (father) OR (mother). You might reasonably ask, why do we need all the parentheses? Technically, we don’t need the outside set but they make the expression easier to read when it is all together like this: ((great )*grand)?((fa|mo)ther) It would be perfectly reasonable to write an expression like this: (fa|mo)ther. We do need the inside set because if we got rid of them, the expression would look like this: fa|mother , which means, either fa OR mother.

Now let’s go back and look at the first half, the part that came before the question mark:

((great )*grand)

The star tells us to match zero, one or more than one instances of the expression before it. So it can match a string which doesn’t include great, in which case we just have grand, and of course, we always have the end of the expression, which will either be mother or father. So we might match to grandmother or grandfather. It can match a string which includes great just once, in which case we have great grand mother OR great grandfather. And it can match a string which includes great more than once, so we might end up with great great great great grandmother OR great great grandfather.

So there you have it, all your ancestors with just one Regular Expression.

If you didn’t understand any of that, please send me email, steif -at- lunametrics.com. (I am always disappointed that no one ever comments on RegEx posts.)

Robbin
LunaMetrics

Regular Expressions Part XI: Real Wildcards .*

Now we are (I am) ready for a Google Analytics Regular Expression that is truly a wildcard .*

Months ago, I wrote a blog post about Regular Expressions Wildcards for Google Analytics. But when I went back to it, it was only semi-intelligible, so I deleted it and created all the Regular Expression building blocks first. If you like, you can read all ten of them:

… you can read all of them, stretching out over a year:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
Lookahead

Now that you (or perhaps more correctly, I) understand the building blocks, let’s talk about how to create real wildcards.

Most of us are familiar with a star as a wildcard, outside of Regular Expressions. We can search for all our .jpg files on our computer with this: *.jpg, which to us means “get everything.jpg.” However, with Regular Expressions, a star only means repeat the last character zero times or once or more than once. In order to make it mean “get everything,” you have to pair it with a dot, like so: .*

Why? Because, a dot means get any character. A star means, repeat the last character zero times or once or more than once. So the combination means, repeat any characters as often as you like, i.e. get everything.

If we wanted to get every occurance of a jpg file, we would do it with a RegEx that looked like this:
.*\.jpg

For those of you who are scratching your head instead of nodding your heads, here is why: .* tells Google Analytics to match everything (as described above). The next part of the expression \. tells GA to then match a real dot. This is because dots are usually wildcards in their own right, but using a backslash turns them into ordinary dots. The last three characters, jpg, tells GA to match the letters jpg. So we get end up with “everything.jpg,” which was just what we wanted.

Robbin
LunaMetrics

Many thanks to Justin and his awesome RegEx Tool (which doesn’t require a download.) Postscript: And of course, thanks to Steve, who taught me Regular Expressions from the beginning and found an error in this original post.

Regular Expressions Part X: Stars *

This is Part X of the long long series I have been doing on Regular Expressions (RegEx) for Google Analytics. It is the last one I will do that explains what Google says vs. what they mean.

When it comes to stars (or call them asterisks if you like), Google Analytic says this:

* Match zero or more of the previous items

Perfectly reasonable, if you know how to create a list of previous items. If you already read Post IX, use of the plus sign in RegEx, this will be easy, and if not, I’ll try to make it easy.

If the only special character you are using is the star *, then the previous item is defined as the previous character. For example, let’s say that my company has five digit part numbers, and I want to know how many people are searching for part number 34. The problem I have are all those leading zeros – technically, the part number is PN00034. So I could use the little Google Analytics filter box in my search report with a RegEx like this: PN0*34. That will bring me back all the searches for PN034 and PN0034 and PN00034 and PN00000034 and for that matter, PN34, since using the star means that the previous item doesn’t need to be in the search — zero or more of the previous items, it says.

Alternatively, we could build a list of previous items using square brackets. Like in my post on plus signs, I had a hard time finding a reason someone would want to use this, but again, used the example that Steve gave me. His example was square brackets with a space. So, I could do a search for my company name in the same filter box on the keywords report, like so:
Luna[ ]*metrics. That will come back with LunaMetrics (no use of the space) or Luna Metrics, or Luna Metrics, etc.

For the sake of completeness, I should point out that you can put real characters in the square brackets like this:b[aeiou]*d, and it matches bad and bed and bid and bod and bud. But for that matter, it matches baaaad and boud and bd, so I don’t think it is particularly useful. If I really just wanted to see those five examples (bad, bed, bid, bod and bud), I would be smarter to use the OR pipe | and do it like this: b(a|e|i|o|u)d.

Anyone who has a great example of using a star with square brackets is strongly encouraged to comment.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics

Regular Expressions Part IX: Plus signs +

This is part nine of a multi-part series I am doing to make Regular Expressions (“RegEx”) more understandable for users of Google Analytics. I am learning and teaching at the same time.

Today, I am writing about the plus sign, +. Here is how GA defines the plus sign:

+ Match one or more of the previous items

This probably seems perfectly reasonable to Google and other old hands at Regular Expressions, because they already know how to define “the Previous Items.” But I didn’t and had to figure it out.

The simplest meaning of “Previous Items” is “previous character.” So, I could look for my name in my Google Analytics search terms by typing this into the “quick filter” box: Rob+in. That will return Robin or Robbin, and for that matter, Robbbbin. It’s actually pretty useful, since I get a lot of searches like that, and often want to filter them out.

Alternatively, you can build a list of Previous Items by using square brackets. Like this: [abc]+ This will return a, ab, cab, c, b, bbbb and the like. This seems a little strange, but in fact, if you read the interpretation (“match one or more of the previous items”), you’ll notice there isn’t anything about the previous items being in a specific order.

However — I couldn’t think of any way to use square brackets with a plus sign that I couldn’t achieve with other Regular Expressions. So I wrote my tutor in Australia, Steve. The best example he had was searching for a space. Thus: if I wanted to know how many people type in web site or web site or even web site, I could create web[ ]+site for my Regular Expression.

If you have any other great uses of square brackets and plus signs for Google Analytics, please share them.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics