412.381.5500
1.877.220.LUNA

Archive for the ‘Regular Expressions’ Category

Wanted: GA Analysts who don’t do Regular Expressions

Friday, May 9th, 2008

Do you do this Regular Expression stuff in Google Analytics? If you don’t, I would really like your help. (Sorry, this is for newbies only.)

I worked with the documentation pros at GA to improve the Help Center documentation on Regular Expressions, but I can no longer look at it through newbie eyes. I still worry that the uninitiated will read and and think, “Huh?” but maybe I am wrong. I wrote another article, but the documentation people said, “Robbin, we think we have enough.” And maybe they are right. (That’s the part that I don’t know.)

So please go to this page in the GA Help Section (if you are new at it), and send me email, telling me how new you are to this, and whether it made enough sense for you to be able to get started with Regular Expressions. Send me email so that others won’t be influenced by your thoughts. Remember, the less you know about this, the more valuable your thoughts are.

Getting More than One Requirement out of your GA Report Filter

Wednesday, January 16th, 2008

The Problem

Do you sometimes want a second report filter at the bottom of your keyword reports?

Report Filter

I do. Every time I want to see phrases containing a particular word, but not containing branded keywords. For example, I might want to see all searches containing “conversion”, but not “lunametrics conversion.” It is a recurring theme for me.

 

The Solution

So what do you do? It’s called a Lookahead, and it comes in two versions. Positive Lookahead and Negative Lookahead. And we can chain together as many as we want.

Let’s say I want to find all keyword phrases that contain conversion and contain website but do not contain lunametrics. Here is what I’d type into the report filter:

^ (?=.*conversion)(?=.*website) (?!.*lunametrics).*$

The (?! begins a negative lookahead (must not match) and a (?= begins a positive lookahead (must match). The regular expressions inside the lookaheads can be as complex as you want. But if you just want to follow the formula I used here. . .

 

The Recipe

1. Start with ^
2. Place each word you don’t want inside a Negative Lookahead : (?!.*word)
3. Place each word you do want inside a Positive Lookahead: (?=.*word)
4. Chain together as many of each as you want
5. Finish up with .*$

Caveat

This example will match anything with “website” in it anywhere. If you want to match exactly “website” and not “websites” or “123website”, use (?=.*\bwebsite\b) instead.

Keyword Analysis by Number of Terms (and the RegEx that helps)

Wednesday, January 2nd, 2008

Do long search phrases convert better?

This was what I wanted to find out for a particular client, but it took some work. I used a regular expression in the Keywords Report of Google Analytics to filter by the number of terms in the Keyword Phrase. The exported results showed a clear increase in conversion rate as the number of search terms increased.

This client was doing far better with searchers who were using a lot of terms. They were being specific! They knew just what they were looking for and were ready to buy. This data put additional power behind recommendations concerning content, search engine optimization and paid search strategies.

1 .59%

2 .60%

3 .90%

4 1.17%

5 1.06%

6 1.22%

7 1.88%

8 3.33%

 

 

Even though there were a lot of people using long search phrases, this data was obscured. As the number of terms increased, the number of people searching for exactly that phrase decreased. This resulted in none of the individual phrases seeming to count for much. The so-called Long Tail.

You really have to dig to find these sorts of gems but they are invaluable in the pursuit of providing information that can be acted upon.

A tool for digging

The tool is a Regular Expression, a pattern matching language. If you’re not already familiar with it, there is a great series of articles right here on the LunaMetrics blog.

Here is what I used:

^([\\+*"*\\s*,*'*\\-*]*\\w+\\b\\s*[\\+*"*\\s*,*'*\\-*]*){3}$

It accounts for the most common characters I’ve found between words.

Steve (see comments) pointed out a great way to shorten my expression by using the \W character set. Here is what it looks like.

^(\\W*\\w+\\b\\W*){3}$

\W is shorthand for all non-word characters

How do I use it?

How_to_use_it

I know this may look like gibberish but keep reading — you don’t need to understand it to get some use from it.

In Google Analytics, go to Traffic Sources > Keywords and paste the Regular Expression into the box at the bottom of the data. Just change the {3} to whatever number of terms you want to see and click the GO button.

A brief look at the RegEx

Although this is not strictly a Regular Expression post, I feel obligated to include a basic glance at the different parts of the expression. Feel free to skip this if you just don’t care.

^ anchors the beginning of the match to the beginning of the string

( ) used to group a set of items together for a match

[\+*"*\s*,*'*\-*]* This group matches any number and any order of + ” , – ‘ and whitepace (\s). It is what handles all the characters that might end up separating different search terms.

\w+ Matches 1 or more alphanumeric characters (the \w is another pre-defined set of characters like \s)

\b Match for a word boundary. It forces the \w characters to be separated by something. Otherwise the expression will match any string of characters longer than {3}.

{3} Requires exactly 3 of the above sequence so it would match the phrase one two three but not one two three four

$ anchors the end of the match to the end of the string

 

Don’t Sweat the Small Stuff

You can’t account for every situation. For example, sometimes ‘ is meant as an apostrophe and sometimes “-” is used as a hyphen. In the end the impact is usually small – just 2-3% of the search phrases were affected in my case and they just get bumped to the next higher match instead. (For example, non-glare window would match at {3} instead of {2})

It is an interesting way to look at keyword data and maybe you’ll get some use from it– if you do, let me know.

 

Regular Expressions for GA, Bonus III: Lookahead

Wednesday, August 8th, 2007

2007_02230074.JPGI hope this will be the last installment, for some time to come, of my Regular Expressions (RegEx) for Google Analytics series. At the end of the post, I have finally threaded all the RegEx posts.

Tonight you can see some very cool (if you care) regular expressions for GA that look ahead and decide whether the match is allowed. Sort of, conditional match. There are two kinds of look aheads: negative lookahead (don’t match if) and positive lookahead (only match if …) Like braces, this is a ReGex that works in Google Analytics, but there is no documentation on it.

Here’s an example that I had today. I was working with a GA site where they have membership, not customers. The site owners use the word members in some of their URIs, and they also use membermail and memberthis and memberthat and membertheother. And a whole lot of other memberexamples.

So let’s say that I needed to create a filter with a regular expression to include all those member URIs in GA, but I want to make sure that I don’t include membermail, since mail is in a special category in my marketing mix. Now we’re in a position to formulate the question really well: how do we include all the Request URIs that have the string member in them where that string is not followed by the string mail? In other words — don’t match member if it is followed by mail.

Steve gets the credit for this one. He suggested that GA might work with negative lookahead - the ability to combine regular expressions to say, “don’t match if it is followed by…” In our membermail example, the expression would be member(?!mail) .

The opening of the parenthesis, followed by a question mark, tells the RegEx engine, “Watch out, lookahead coming.” The exclamation point says, “And it’s a negative.” Combined, they mean, it’s a negative lookahead - don’t match the first part of the string if the second part is there. Don’t match member if it is followed by mail .

GA also handles positive lookahead. So if we want to match only membermail and not memberthis or memberthat and all the other uris with member, we can write our RegEx like this: member(?=mail) – in this case the open paren and the question mark do the same thing (”Watch out, lookahead coming,”) but the equal sign says, “And it’s a positive match.”

There is one last little fine point to wrap your head around, assuming you are not dizzy already. The lookahead string, or whatever you want to call the string mail in my example, is not part of the match. I know this sounds like gibberish, so let me give a last example. This one’s for all you RegEx fans. And for everyone on the Paris metro:

Example: Let’s say that I am doing a positive lookahead like this: member(?=s)hip. This means, match to member only if it is followed by an s, and then please match to the hip, too. . However, the string membership would not be a match. That seems a little ridiculous. After all, it is member, followed by an s, followed by hip, right?

Well, it doesn’t work that way. That’s because the s is only a condition. In the eyes of the RegEx engine, you sort of have a conditional regex that looks like this:

memberhip (notice, no s)

And, you are trying to match to

membership

It’s not a match, because the s isn’t part of the RegEx. It was just being used as a condition.

OK, we are done for tonight, and

Late note: A reader here, Alan, wrote a wonderful comment, whereby he shows other ways to implement and use the power of positive lookahead. You should read his comment, but the short version is, you can use positive lookahead to match if the “condition” is somewhere down the line. The “lookahead string” — the condition — doesn’t have to *immediately* follow the (?!) if you use the syntax that he figured out. (See, now you have to read his stuff….)

Here is the thread with all the RegEx posts.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

– Robbin

Regular Expressions for GA, Bonus II: Minimal Matching

Sunday, July 29th, 2007

Here it is … a straggling installment on my year-long saga to bring Regular Expressions (RegEx) to regular people. Today, I want to write about how to make your expressions less greedy. It’s a feature called “minimal matching.” Unfortunately for you advanced learners, you have to wait until the end of the article until you see the application. I always want to teach with the easiest of examples (however meaningless they may be…)

Last winter, I wrote about how RegEx are greedy. But Google Analytics uses a flavor of RegEx that allows you to make them less greedy.

Let’s start with a fairly stupid but easy to understand example. We’ll assume that the phrase we want to match to is baaaaa. We could create a regular expression like this:

ba+

That means, start matching at the b, match the a next, and then match one or more a’s. (Right? The plus sign means, match one or more of the previous expression, which is so often, just the previous character.) So this RegEx that I wrote, ba+, will match the entire string:

baaaaa

(If that doesn’t make sense to you, remember that RegEx are greedy, and the default is, they match as much as they possibly can. Here, they will match all the a’s we give it, since the plus sign means, one or more.)

But what if we wanted to match just the first b and the first a, like this: ba

We need to specify minimal matching (and hang in there with me, soon it will all be worthwhile.) We can match to just the first instance of a by using a question mark. Like this:

ba+?

When Justin taught this to me (months ago), I asked, “How does the RegEx know that the question mark is doing minimal matching, and not there to say, match zero or one of the last character?” Well, it turns out that those are the same thing — the question mark in the expression above tells the plus sign, “You can match to one a but no more.”

This concept is actually very helpful when you are matching to URI strings. For example, let’s say you had a string like this:

gibberish/foldertwo/folderthree

And you wanted to rewrite that gibberish to be something meaningful. But because it is gibberish, it looks different every time. You could write a RegEx like this:

.*/

But, it will match to the last slash because the RegEx are greedy. However, if you use minimal matching, like this:

.*?/

it will match to only the first slash. And that’s what minimal matching is all about.

Plus signs +
Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
Lookahead 

Robbin

Regular Expressions for GA Bonus 1: {Braces}

Monday, June 25th, 2007

flamingoesBefore I start - The “Criticize GA Documentation” contest ends on Tuesday, June 26.

We now take a break in our regularly-scheduled programming (which was filters for GA). That’s because I need to return to an old topic, Regular Expressons (RegEx) for GA, and add a much-needed post: Regular Expression Braces.

Braces are curly brackets, like this {these are braces}. GA never mentions them. So, I don’t know if they are an unsupported feature, or a problem with the documentation.

Braces repeat the last “piece” of information a specific number of times. They are used with two numbers, like this: {6,8}. That particular example means, repeat the last piece of information at least six times and no more than eight.

For example, there is a place across the street here in Honolulu called the Rainbow Bazaar. If I wanted to pull a report with all the correct spellings of their name, I could search the report (in the little box at the bottom of the page in the new GA version). I would use the following RegEx:

baza{2,2}r

This means, pull all the keywords that have a baz followed by at least two and no more than two a’s and which are also followed by an r. Hence, bazaar. (Notice that the last letter is my last piece of information.) Or I could use those same braces to pull misspellings, a more interesting report.

The problem with regular expressions is always knowing what they are “working on.” In this case, what is the last piece of information? A set of square brackets or parentheses would make a piece of information. (And in fact, a great use of braces would be to capture all the IP addresses in a block of 0-255, like this: [0-9]{1-3} . It’s true that you will also capture 538 and 627 and all sorts of numbers above 255, but you really don’t care, since the IP block will never go higher than 255, anyway.) In the absence of a well-defined piece of information (defined by parentheses or brackets), you are working with the last character.

Here are all the other RegEx posts:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
Minimal Matching
Lookahead

Robbin

Filters for GA, Part I: Get Ready with Profiles and Regex

Sunday, April 8th, 2007

I promised to write about Google Analytics (in this video). But first, I want to talk about profiles and Regular Expressions, because they will make your work so much easier.

Profiles. So you’re learning about filters, and you’ll probably make some mistakes. Join the crowd. But why make mistakes on the data that you’ve been using for a year now? Keep that “production data” holy, and experiment on a sandbox profile. Even if you think you are an expert at GA, always have at least once sandbox profile, and preferably two.

(Need to understand what profiles are? Well, certainly, you can use a profile within an account to measure a second website. But here, we aren’t talking about profiles for a new website, we’re talking about profiles for the same website. This is one of those concepts that is hard to understand at first, but is trivial once you get it. The idea is, you have multiple copies of your web analytics, all measuring the same thing, and if you set them up exactly the same, they will look exactly the same. However, you don’t have to set them up the same — you can keep one as your “good” copy, and the others can be used to learn. Need to learn how to configure a second profile?)

Having two clean (i.e. no filters) sandbox profiles will help you in a variety of ways: First, you don’t need to worry that the other filters on that profile are messing you up somehow. Second, they both start (one with and the other without the filter) at the same time, so when you write me and ask me why your filter doesn’t work, I promise I won’t ask if you chose a time period that pulled in unfiltered data. Third, since you won’t have taken yourself out of the data (because most people use filters for that, all except those who build special cookie workarounds), you can test it yourself doing all the strange things you’d like to check out.

Regular Expressions. Most filters require regular expressions. Now that I’ve gone through fourteen posts on Regular Expressions (RegEx) for Regular People (and specifically, for GA), I will be referencing that data. And if you already know it, you’ll think that this filter stuff is easy, easy.

Robbin

Regular Expressions question and GA: Search/Replace

Tuesday, March 20th, 2007

This weekend, someone send me a Google Analytics Regular Expression (RegEx) question. The answer is pretty basic but interesting, and there is something to be learned about one of my favorite tools, the Epikone RegEx Tester.

Q: Hi,I’ve read most of your posts about RegEx, but I still can’t manage to find the right RegEx for one of my filters in GA.

I’d like to use a “search and replace” filter for all the pages whose URLs are either / OR /index.asp (which are in fact: www.my-domain.com and www.my-domain.com/index.asp). Basically, I’d like to have all the pages with both URLs displayed as “the page name I gave” in GA reports. This is why he wants to use the search and replace filter - to give the pages his chosen name. Robbin

I have tried several expressions on the RegEx filter tester but none of those seem to work. Note to Epikone: Notice that your tool is now elevated to “the” tester of choice. Robbin

I tried this one below, but I’m not sure that what the RegEx filter tester tells me means the filter is correct or not (I don’t fully understand how this tool works, especially for the “input string” and “result” fields). Here is the RegEx he is interested in:

^(/|/index\.asp)$

When I enter / in the input string, then click submit, the displayed result is Match: /,/

When I enter /index.asp in the input string, then click submit, the displayed result is Match: /index.asp,/index.asp

I don’t know what this result does mean exactly.

Could you tell me if this RegEx (^(/|/index\.asp)$) is correct regarding what I’m after, or if it’s wrong and then could you suggest me a working one ?

And here is my answer:

Robbin: Why don’t you first change your default page to be just index.asp. You can do this in settings > edit > then edit again. Telling GA that your default page is index.asp will stop you from getting a page like this / . This will help you with the search and replace AND help you read your analytics more easily.Then you can do it the simpleton’s way: ^/index\.asp (You really don’t need the dollar sign unless you have urls that end aspx, for example.)

I think if I were wanting to keep both / and /index.asp (a bad idea), my regex would be ^(/index\.asp)|/

It is really the same as yours, just a little simpler and easier to read.

The reason that the Epikone RegEx tester acts the way it does when you write it with parenthesis is that parenthesis tell GA, “I’ve created a variable.” And here, you can read what Justin the Man said about their RegEx tester and creating variables, I found this in old email from him:

Justin’s email: “Why our reg ex tester behaves the way it does. Our tester is pretty smart. If your expression matches the input string, then the tester will return the word ‘Match’ along with the part of the string that the expression matched. Now, if you are using parenthesis to store some part of the expression in a variable, the tool will return the value stored in the variable in addition to the part of the string that the reg ex matches.”

There is at least one other way to do this, too. You could go into the part of the code that reads urchinTracker(), on the homepage and make it urchinTracker(’homepage’).

In the process of writing this, I found that there is a whole piece on the Epikone blog about how to interpret their results.

Robbin

Regular Expressions for Google Analytics: OK, I did it

Friday, February 23rd, 2007

I was finally inspired by Alan, who reads this blog using his Blackberry on the Paris Metro. (Truly a world wide web.) Also inspiring were the comments of an anonymous poster, who wrote that all the Regular Expressions (RegEx) posts were incredibly helpful, but couldn’t I please make them easier to access? So I did it, they are all beautifully threaded now. I even fixed awful typos in the Summary/Intro post and the ^ post (and you know, typos are the worst when you are a newbie learning a technical topic, you can spend hours trying to understand a topic, only to finally learn that the author was just lazy.)

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
Lookahead

Robbin

Intro to GA Regular Expressions: Part XIV of XIV

Sunday, January 28th, 2007

This is the last of fourteen posts I have done on Regular Expressions for Google Analytics. Now that I have learned them (and hopefully explained them), it’s time to have that introductory post (I always do like to work backwards.)

Here’s the reason. People skip the introductions to books, they don’t read manuals, they just want to figure out how to make “it” work, whatever it is.

Only once it works are we ready to say, OK, so what? Why do I care, what’s it good for, what’s it bad for?

What are Regular Expressions (RegEx)? They use characters on your keyboard (like * and ^), enabling you to create an expression that may or may not match a target expression. They have a strict set of rules — just like a programming language would — and it’s easy to make mistakes with them. (This is why I am a big user of this RegEx checking tool. ) So you will always have at least two expressions, the Regular Expression with the funny characters and the expression you are matching on your site, or in someone’s address, or keyword. Here’s a quick example of the Regular Expression vs. target expression issue: I can create a regular expression like this luna|robb?in and then match it against the keywords people used to come to my site to filter out all the times people used my company name or my own name, whether they spelled it right now not. In this example, the keywords were my target expressions. (Need to understand that pipe symbol in the RegEx? Need to understand the question mark in the RegEx?)

So why use them? The first reason that RegEx are worth caring about — if you use Google Analytics — is that Google cares. There are certain tasks Google just won’t let you do correctly without using RegEx. Examples that come quickly to mind are: take yourself out of the data using an IP address, creating a custom filter, creating a filter that enables you to see both your subdomain and your domain in the same profile. (The latter is just an example of the second example, a custom filter, but I mention it because you can read about it in the GA help section.)

Other great examples of custom filters: Create a filter to learn what words people actually type in to Google before they click on your AdWord, instead of just learning which AdWord gets credit. (I use this one all the time. The only hack I like better is this one.) Force all your reports to give you pages by title instead of URL.

OK, so we understand that they are needed for filters. But how about goals? After all, you can do a head match or an exact match, why go to the trouble of using a RegEx match?

One reason would be if you have two pages that are essentially the same goal or the same place in the funnel. So, for example, let’s say that when the visitor reaches either of these two pages, www.mysite.com/folder3/thanks.html and www.mysite.com/thanksalot.html, he has really achieved the same goal. By using RegEx, you can make your goal page in Google Analytics /thanks and whenever someone reaches either of those pages, the same goal (G1, or G2, or whichever one you choose) is incremented. Then if you happen to care about which page actually matters the most, you can easily go to the Content Optimization > Goals and Funnel Verification > Goal verification to see which page mattered the most.

Of course, if you have other pages on your site that match /thanks, you have to get more specific with your RegEx. However, I never forget the lesson that a friend taught me: keep your RegEx as simple as possible.

Another reason to use RegEx is when you are lazy. After all, just because there are a ton of variations, who wants to create a ton of iterations? The example above shows how to combine two into one, but what if you had 15 variations? Example: you have to be sure that your company is not in the analytics. Your company owns all the IP addresses from 72.77.12.26 through 72.77.12.40, inclusive. You sure don’t want to create 15 filters. Instead, you can use a regular expression like this: ^72\.77\.12\.(2[6-9]|3[0-9]|40) — it will capture all fifteen IP addresses. (The little carat ^ at the beginning says, don’t match if there is something before the 72. The backslashes \ turn the special dots into plain dots. This 2[6-9] means, match 26, 27, 28 and 29. This 3[0-9] says, match 30, 31, etc through 39. 40 says, match 40. The pipes are OR signs. So at the end of the expression, you’ll be matching to 26-29, OR 30-39 OR 40.)

Well, that’s it on RegEx for now. Here are all the prior posts in the series:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

Now that I am done with this series, I’ll go back and make sure that all the posts links to all the other posts consistently. Done! Done! All done!

Robbin
LunaMetrics