412.343.3692
1.800.975.1844

Archive for the ‘Regular Expressions’ Category

Regular Expressions Part XIII: Good Greed

Thursday, January 11th, 2007

This is my next to last post in this Regular Expression (RegEx) series. I have been thinking about this post for a long time and yesterday someone asked me a question (which finally got me to write this). She wrote that she had two pages that she wanted to roll into one Google Analytics goal. She created the Regular Expression for it, ran it through Epikone’s RegEx Coach, and it worked — but it wasn’t working in GA. (More on the Coach below.)

The two pages were:

subdomain.mysite.com/folder/subfolder/GoalThree.php
subdomain.mysite.com/folder/subfolder/GoalThreesome.php

She sent me a long, complicated expression which wasn’t working for her and asked my opinion.

This is absolutely a case of putting Good Greed to work for you, we will see in a minute. As I wrote in my last post, Regular Expressions are very greedy and they match everything unless you tell them not to. This is a very hard concept to wrap your head around — it means that, among other things, all the stuff before the expression and all the stuff after it gets matched to random things (unless you tell it not to. Or there is nothing to match to.)

Anyway, I wrote her back and said, why don’t you just write an expression like this:

/folder/subfolder/GoalThree

This assumes that she doesn’t have other GoalThreeVersions that will be incorrectly mixed in here. If, for example, she had another page, /folder/subfolder/GoalThreeCornered, that would qualify as a match too (because the RegEx matches everything it can, even if those characters aren’t in the Regular Expression.) Moving back to how simple her RegEx might be, she might even have been able to get away with a goal like this, depending on her site:

/GoalThree

This matches every expression that includes /GoalThree

Finally a word about the Epikone RegEx coach. I haven’t talked to Justin about this. But I am fairly sure that the coach is configured to check whether the phrase you type is a match to the RegEx you type, using the way GA interprets RegEx. That doesn’t mean that you necessarily come up with a valid goal, or an IP address that will actually filter anything. For example, you might use it to see if colou?r is a valid RegEx for color and for colour (it should be), but that doesn’t mean colou?r will necessarily work in your Google Analytics profile filters or goals. You really have to understand the context in which you are using the expression and what GA demands of you in addition to correctly configuring two expressions to match each other.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

Robbin
LunaMetrics

Regular Expressons Part XII: Bad Greed

Saturday, December 2nd, 2006

Now that I have learned and explained the Regular Expressions that Google Analytics uses:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

I want to explore another area: Regular Expressions and the concept of greediness.

You might be tempted to write a Regular Expression like this:

/mypage/

expecting it to match the page on your site called /mypage/

And this Regular Expression really does match /mypage/. But it also matches /mypage/thirdpage-and-something-else . For that matter, it matches /secondpage/mypage.html and mypage.htm and mypage.asp.

Regular Expressions are greedy — they match and match as much as they can. Greed can be good, but first I want to write about the obvious problem, i.e. the RegEx (Regular Expression) will match too many strings to be useful.

We can deal with this in various ways:

1) Tell the RegEx where to start. In the above example, if we wrote the RegEx like this

^/mypage/

it will only match when /mypage/ is at the beginning of the line, so it will never match /secondpage/mypage/ etc.

2) Tell the RegEx when to stop. We can do this in various ways, but need to know how the expression ends. For example, are we only looking for mypage.htm or are we also looking for all the pages that are in the /mypage/ folder — /mypage/otherpages.htm? If only mypage.htm matters, then we can include that in the RegEx:

/mypage\.htm$

Notice that I used a backslash to make the dot into a real dot and not a special character, and a dollar sign to say, this only works at the end of the line. (That way, mypage.html won’t match.)

We can combine that with #1 above and create this RegEx:

^/mypage\.htm$

and never get any unexpected characters before the slash or after the htm.

On the other hand, if all the pages in the /mypage/ folder that have .htm suffixes are of interest, we could do this differently:

^/mypage/.*\.htm$

I really hate when people throw this kind of gobbledygook at me so let me see if I can explain in pieces:

^/mypage/ = only consider the match if /mypage/ is at the start of a line…
.* = match everything that comes next until…
\.htm$ = you get to the last real period followed by htm and it’s at the end of a line.

Clear as mud, eh?

Robbin
LunaMetrics

Regular Expressions Part XII: Now Let’s Practice

Monday, November 27th, 2006

Now that I have learned and then explained all the Regular Expressions for Google Analytics:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

let’s can work backwards, i.e. look at some expressons and figure out what they mean and why.

I always hate when techies give a really simple explanation and then jump to the hardest example possible, so I will try not to do the same (which is easy for me, not being a techie and all.) Let’s start with these warm-up examples from the Wikipedia entry on Regular Expressions.

  • “.at” matches any three-character string like hat, cat or bat. Reason: Because a dot matches any character. So hat, cat and bat are all good matches, as would be any other one character match to the dot.
  • “[hc]at” matches hat and cat. Reason: Because square brackets create a list of items, and you can match to any one item in the list. So this expressions matches hat by pulling the “h” out of the square brackets, and it matches cat by pulling the “c” out of the square brackets, but unlike the former example, it doesn’t match bat — that’s because there is no “b” in the square brackets.
  • “[^b]at” matches all the matched strings from the regex “.at” except bat. Reason: This is an alternative of the carat ^ – when it is inside square brackets at the beginning, it means “not.” Thus, the [^b] means, don’t match a b.
  • “^[hc]at” matches hat and cat but only at the beginning of a line. Reason: This is a more standard use of the carat ^ — it is not inside square brackets so it means, the RegEx will match your expression only if your expression starts at the beginning of the line.
  • “[hc]at$” matches hat and cat but only at the end of a line. Reason: This is identical to the second example in this list, except for the dollar sign at the end. The dollars sign ensures that the RegEx only matches your string if your string’s characters come at the end of a line.

OK, here is a slightly harder one, also from Wikipedia:

((great )*grand)?((fa|mo)ther)

I will take this apart to make it easier to understand.

The parenthesis create groups, separated by a question mark. So we effectively have:

(expression in this set of parenthesis)?(another expression in this set of parenthesis)

Since a question mark usually means, include 0 or 1 of the former expression, we know that this RegEx is allowed to match just the stuff in the second set of parenthesis (right? That’s what question marks do, they can match what comes right before them or not match what comes right before them. If they don’t match the stuff before them, only the characters after them are left to match.) So, let’s start by looking at the second half only, which we know should be able to stand by itself:

((fa|mo)ther)

The pipe symbol | means OR. So this resolves to (father) OR (mother). You might reasonably ask, why do we need all the parentheses? Technically, we don’t need the outside set but they make the expression easier to read when it is all together like this: ((great )*grand)?((fa|mo)ther) It would be perfectly reasonable to write an expression like this: (fa|mo)ther. We do need the inside set because if we got rid of them, the expression would look like this: fa|mother , which means, either fa OR mother.

Now let’s go back and look at the first half, the part that came before the question mark:

((great )*grand)

The star tells us to match zero, one or more than one instances of the expression before it. So it can match a string which doesn’t include great, in which case we just have grand, and of course, we always have the end of the expression, which will either be mother or father. So we might match to grandmother or grandfather. It can match a string which includes great just once, in which case we have great grand mother OR great grandfather. And it can match a string which includes great more than once, so we might end up with great great great great grandmother OR great great grandfather.

So there you have it, all your ancestors with just one Regular Expression.

If you didn’t understand any of that, please send me email, steif -at- lunametrics.com. (I am always disappointed that no one ever comments on RegEx posts.)

Robbin
LunaMetrics

Regular Expressions Part XI: Real Wildcards .*

Monday, November 20th, 2006

Now we are (I am) ready for a Google Analytics Regular Expression that is truly a wildcard .*

Months ago, I wrote a blog post about Regular Expressions Wildcards for Google Analytics. But when I went back to it, it was only semi-intelligible, so I deleted it and created all the Regular Expression building blocks first. If you like, you can read all ten of them:

… you can read all of them, stretching out over a year:

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching
Lookahead

Now that you (or perhaps more correctly, I) understand the building blocks, let’s talk about how to create real wildcards.

Most of us are familiar with a star as a wildcard, outside of Regular Expressions. We can search for all our .jpg files on our computer with this: *.jpg, which to us means “get everything.jpg.” However, with Regular Expressions, a star only means repeat the last character zero times or once or more than once. In order to make it mean “get everything,” you have to pair it with a dot, like so: .*

Why? Because, a dot means get any character. A star means, repeat the last character zero times or once or more than once. So the combination means, repeat any characters as often as you like, i.e. get everything.

If we wanted to get every occurance of a jpg file, we would do it with a RegEx that looked like this:
.*\.jpg

For those of you who are scratching your head instead of nodding your heads, here is why: .* tells Google Analytics to match everything (as described above). The next part of the expression \. tells GA to then match a real dot. This is because dots are usually wildcards in their own right, but using a backslash turns them into ordinary dots. The last three characters, jpg, tells GA to match the letters jpg. So we get end up with “everything.jpg,” which was just what we wanted.

Robbin
LunaMetrics

Many thanks to Justin and his awesome RegEx Tool (which doesn’t require a download.) Postscript: And of course, thanks to Steve, who taught me Regular Expressions from the beginning and found an error in this original post.

Regular Expressions Part X: Stars *

Wednesday, November 15th, 2006

This is Part X of the long long series I have been doing on Regular Expressions (RegEx) for Google Analytics. It is the last one I will do that explains what Google says vs. what they mean.

When it comes to stars (or call them asterisks if you like), Google Analytic says this:

* Match zero or more of the previous items

Perfectly reasonable, if you know how to create a list of previous items. If you already read Post IX, use of the plus sign in RegEx, this will be easy, and if not, I’ll try to make it easy.

If the only special character you are using is the star *, then the previous item is defined as the previous character. For example, let’s say that my company has five digit part numbers, and I want to know how many people are searching for part number 34. The problem I have are all those leading zeros - technically, the part number is PN00034. So I could use the little Google Analytics filter box in my search report with a RegEx like this: PN0*34. That will bring me back all the searches for PN034 and PN0034 and PN00034 and PN00000034 and for that matter, PN34, since using the star means that the previous item doesn’t need to be in the search — zero or more of the previous items, it says.

Alternatively, we could build a list of previous items using square brackets. Like in my post on plus signs, I had a hard time finding a reason someone would want to use this, but again, used the example that Steve gave me. His example was square brackets with a space. So, I could do a search for my company name in the same filter box on the keywords report, like so:
Luna[ ]*metrics. That will come back with LunaMetrics (no use of the space) or Luna Metrics, or Luna Metrics, etc.

For the sake of completeness, I should point out that you can put real characters in the square brackets like this:b[aeiou]*d, and it matches bad and bed and bid and bod and bud. But for that matter, it matches baaaad and boud and bd, so I don’t think it is particularly useful. If I really just wanted to see those five examples (bad, bed, bid, bod and bud), I would be smarter to use the OR pipe | and do it like this: b(a|e|i|o|u)d.

Anyone who has a great example of using a star with square brackets is strongly encouraged to comment.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics

Regular Expressions Part IX: Plus signs +

Thursday, November 9th, 2006

This is part nine of a multi-part series I am doing to make Regular Expressions (”RegEx”) more understandable for users of Google Analytics. I am learning and teaching at the same time.

Today, I am writing about the plus sign, +. Here is how GA defines the plus sign:

+ Match one or more of the previous items

This probably seems perfectly reasonable to Google and other old hands at Regular Expressions, because they already know how to define “the Previous Items.” But I didn’t and had to figure it out.

The simplest meaning of “Previous Items” is “previous character.” So, I could look for my name in my Google Analytics search terms by typing this into the “quick filter” box: Rob+in. That will return Robin or Robbin, and for that matter, Robbbbin. It’s actually pretty useful, since I get a lot of searches like that, and often want to filter them out.

Alternatively, you can build a list of Previous Items by using square brackets. Like this: [abc]+ This will return a, ab, cab, c, b, bbbb and the like. This seems a little strange, but in fact, if you read the interpretation (”match one or more of the previous items”), you’ll notice there isn’t anything about the previous items being in a specific order.

However — I couldn’t think of any way to use square brackets with a plus sign that I couldn’t achieve with other Regular Expressions. So I wrote my tutor in Australia, Steve. The best example he had was searching for a space. Thus: if I wanted to know how many people type in web site or web site or even web site, I could create web[ ]+site for my Regular Expression.

If you have any other great uses of square brackets and plus signs for Google Analytics, please share them.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics

Regular Expression posts made easier

Saturday, October 28th, 2006

I’ve been writing about Regular Expressions for Google Analytics for some time now. The more I learned, the more I wanted to rewrite my very earliest posts, because In The Beginning, I took easy topics and made them hard. Or, I combined too many expressions together without just starting with basic ideas.

Anyway, I rewrote Part I, the backslash and I also rewrote Part II. Originally, Part II addressed multiple wildcards but I simplified it to be just the dot. I will deal with the plus sign + and the asterisk (which RegEx types like to call a star) in future posts.

Over time, I will get them all cross-indexed. (Done, done done!! at last. ) When I change the post title, I break the link, so I’ll fix that too. (sorry!)

If you got lost learning about wildcards on Post II, this would be a good time, to the extent that you actually have time, to go back and just learn about dots.

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics

Regular Expressions Part VIII: [Square Brackets] and Dashes -

Sunday, October 22nd, 2006

Come learn Regular Expressions for Google Analytics with me. I am learning Regular Expressions for Google Analytics and teaching with each lesson. This is why I roll them out slowly - each expression requires a lot of research. I have been awed at this process because the explanations are so opaque before I understand them, and once I learn them, they make perfect sense. Tonight, let’s talk about square brackets, and I hope you’ll see what I mean.

Google Analytics defines square brackets like this:

[] Match one item in this list

This is exactly what they mean, it just sounds hard because they don’t tell you how to create the list and how to define an item. Simple explanation: When you use square brackets, each character within the bracket is an item. Look at this sample list with five items in it, each of which happens to be a vowel: [aeiou]. The hard part is undertanding that you don’t need anything to separate the characters, and that each item in the list is only one character.

Here’s how someone might use square brackets with Google Analytics. Let’s say you were selling items with part numbers formatted like this: PART1, PART2, etc. You want to know how often someone lands on your site by typing the actual part number into a search engine, but you only care about PART3, PART5 and PART7. So, you could enter PART[357] into the fiter box on the top of your Overall Keyword Conversion report (for example). That will match each of those part numbers. (Technically, it matchest one of these three and more, but I will hold that problem/opportunity for a different post.)

It’s helpful to understand dashes so that you can use square brackets easily. Google Analytics defines dashes like this:

- Create a range in a list

That means, instead of creating a list like this [abcdefghijkl], you can create it like this: [a-l], and it means the same thing — only one letter out of the list gets matched. You can also combine the range method and the brute force, type-them-all-in method and create a list like this: [a-lqtz], which matches any one letter between a and l, or q, or t, or z.

Special case: Sometimes — perhaps often — we really want the dash to be one of the characters we are searching for. Maybe we want to see searches of luna-metrics and lunarmetrics and lunammetrics. In that case, we put the dash at the beginning or end of the list, like this [-rm]. That means that the full RegEx which would match the three lunametrics keywords above would be luna[-rm]metrics. This is because the phrase will start with luna, end with metrics, and in between will have a dash, an r, or an m. Those are the only choices in the little list I created, the one that looked like this: [-rm].

There are other interesting things that you can do with square brackets, but I am leaving them out for now, either because they don’t all work with Google Analytics, or because I think this is enough for today. (Correct me if I’m wrong!)

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics

Regular Expressions Part VII: (Parenthesis)

Thursday, October 12th, 2006

As promised, here is installment VII of my Regular Expressions (RegEx) tutorial - parenthesis. I am learning and sharing at the same time. I am only learning them to use for Google Analytics.

I wanted to get this one out soon after my last RegEx post, because the last one was on the use of pipes, which stand for OR in Regular Expressions. Pipes (OR symbols) and parenthesis often go together.

My tutor, Steve in Australia, does a really good job of explaining parenthesis. In the same way that this mathematical statement,

6*(2+3)

is equivalent to 6*2 plus 6*3, parenthesis in Regular Expressions make sure that the stuff outside of the parenthesis get applied to the stuff inside of the parenthesis equally.

For example — and remembering that the pipe symbol | stands for OR — we can have a regular expression like this:

grand(mother|father)

That will match either grandmother or grandfather.

Or, here is another, similar but not identical example:

Ste(ph|v)en

that will match either Stephen or Steven

What if the two terms are really different and there isn’t much in the way of grouping to do? For example, what if we want to filter out Robbin or Luna (which I do all the time in my GA)? Then we can go back to the last lesson on OR and just use a simple pipe:

Robbin|Luna

(Often, even people who know me well misspell my name, so I could use what I learned in lesson V, question marks, to make the second “b” optional, like this: Robb?in|Luna)

In Google Analytics (I won’t speak to other languages) we don’t need to use any parenthesis if there isn’t any grouping — the pipe can stand on its own. Or as Justin always tells me, keep it simple.

[Incredibly techie addition: My last comment about never needing parenthesis when there is nothing outside the parenthesis is not always true. At the eMetrics Summit, Nick from Google and Justin from Epikone taught me a lot about creating custom filters and during that process, explained how parenthesis define a variable. I will revisit this topic later.]

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics

Regular Expressions Part VI: OR

Wednesday, October 4th, 2006

This is the sixth installment of my Regular Expression lessons. I am actually learning more than teaching and just sharing as I go along. These are Regular Expressions for Regular People (c), so all the tech-talk is removed. My motivation for learning RegEx, as they are called, is Google Analytics.

OR gets symbolized by the pipe symbol |. The pipe symbol is on my US keyword just above the Enter Key but for some reason looks like two vertical dashes on the keyword itself. It’s incredibly simple, and even Google Analytics does a fabulous job with its description:

This was a hard one to screw up, although they have done a good job of screwing up other easy Regular Expressions.

Here’s an example. One of the sites that I work with is an engineering education site, and they teach both Statics and Finite Element Analysis (FEA). Some people also refer to the latter as FEM (Finite Element Method). If I were only interested in the statics keyword searches, I would probably want to get references to FEA and Finite Element Analysis and FEM out of the search reports. I could do that easily in GA by going to the little filter box that can be found on each report, making it into a red minus sign (so that I am filtering out) and typing in FEA|Finite|FEM . This has the effect of saying, “Get rid of references to anything that includes FEA OR Finite OR FEM.”

Backslashes \
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes -
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
{Braces}
Minimal Matching
Lookahead

Robbin
LunaMetrics