Upcoming LunaMetrics Seminars
Boston, Oct 6-10 Chicago, Oct 20-24 Seattle, Nov 3-7 New York City, Nov 17-21

Cleaning up URLs in Google Analytics

We wrote recently about a visitor who asked about enforcing capitalization consistency in URLs in Google Analytics. This is a pretty common thing.

In fact, there are a variety of ways your URLs can show up in inconsistent ways in Google Analytics. What you should recognize is, whatever the URL of the page is in your browser is what Google Analytics records. Now, often your webserver treats slightly different URLs as exactly the same page (differences in capitalization, leaving off a trailing slash, and so on). So if multiple versions of a URL are, in actuality, the same page, we want to clean up those URLs in Google Analytics.

Default pages

Here’s one scenario: you look in your Top Content report and you see your home page in a couple of places, like this:

  • /
  • /index.php

Or something like that. (Remember the URLs we see here are only the part after the .com or .org or whatever. So one of these represents http://www.example.com/ and one represents http://www.example.com/index.php.) We know these are both really the home page, but seeing them as two separate URLs in this report isn’t very helpful. We have to add up the pageviews for both to see what the total number is.

Google Analytics gives us a simple way to fix this. It’s in your Profile Settings and it’s this one right here:

Here, we can just put in “index.php” as the default page. Now Google Analytics will just add “index.php” to any URL that ends in a slash. Tada!

Multiple default pages

That doesn’t work for every scenario, however. Consider this:

  • /
  • /home.php
  • /careers/
  • /careers/index.php

Well, we can’t go and use the “Default page” setting from above, because now there are multiple possibilities, depending on where we are in the site.

Or, for that matter, what if we like the nice, clean, trailing slash URLs and want to get rid of all the index.php?

Well, this you can do with a Search and Replace filter. The setup looks like this:

Notice that I’m searching for anything ending in “/index.php” (the dollar sign means “end with” in Regular Expressions). I’m replacing that with just the slash. In the example above with both “index.php” and “home.php”, I could just create two filters, one for each one. Once I’m done, for my data going forward, I just get the trailing-slash versions of the URLs.

Trailing Slashes

Here’s one more thing that can be a problem, and this one is really challenging:

  • /careers
  • /careers/
  • /careers/index.php

We’ve already solved the “index.php” problem. But notice we also have a problem with slashes. A lot of webservers automatically correct for this kind of thing with redirects. (Here’s some information on how to do it with Apache.) But if yours doesn’t, you can fix the data in Google Analytics (again, with filters).

This one is a little hard, because the patterns we want to match are kind of ambiguous. Here’s what I came up with, but chime in on the comments if you have a cleaner solution.

So, here’s the regular expression I used to match these URLs:

^(/[a-z0-9/_\-]*[^/])$

OK, so it starts with a slash (duh), then it contains 0 or more characters that are alphabetic (a-z), numeric (0-9), or an underscore or hyphen. (You may have to adjust a little if you have other characters in your URLs). Then it ends with a character that is NOT A SLASH. (That’s the important part.)

Why such a specific Regular Expression? Why not just:

[^/]$

That says “ends with any character that is not a slash”. Well, unfortunately that’s probably not specific enough. Because we might have pages like the following:

  • /careers/jobs.php
  • /careers/?search=web%20analyst

Notice that neither of those end in a slash, but they’re not the kind of URLs we want to end in a slash, either. So we need a regular expression that doesn’t match a few key characters (like the dot and question mark) that clue us in we don’t have just a directory name, but a full page in the URL.

So then the Advanced Filter just grabs the original part of the URL and appends a slash to it.

So that’s a variety of instances in which you have opportunities to clean up URLs and improve the data you have in Google Analytics.

Jonathan Weber

About Jonathan Weber

Jonathan Weber is the Data Evangelist at LunaMetrics. He spreads the principles of analytics through our training seminars all over the East coast. The next seminar he'll be leading will be a Google Analytics training in Boston. Before he caught the analytics bug, he worked in information architecture. He holds a Master’s degree from the University of Pittsburgh School of Information Sciences. Jonathan’s breadth of knowledge – from statistics to analysis to library science – is somewhat overwhelming.

http://www.lunametrics.com/blog/2010/12/17/cleaning-urls-google-analytics/

16 Responses to “Cleaning up URLs in Google Analytics”

Tristan Bailey says:

I think this sort of thing is easier to fix in Apache rewrite rules and redirects. Also in the careful work practices of not linking to bad versions of the links so those versions do not get indexed in search engines.
There are always links that slip through so maybe both are needed?

Robbin says:

Well, I was really sure Jonathan was going to show how to make sure all the URLs *didn’t* end in a slash… the problem with that idea would be that the home page is often just / and would need to have its own filter first, so that it could be handled as a separate case. But when I asked him why he didn’t do it the reverse way, i.e. no ending slashes and a special home page filter, he pointed out that if all the URLs ended without slashes, there would be another problem — the Content Drilldown reports wouldn’t be able to understand the directory structure well.

Phil says:

Hi,

Interesting post, and a common problem with website that use both “/” and “/index.php”, or just missing slash pages. The GA documentation on he use of default page is very poor & misleading, so your blog post adds clarity.

Note: You forgot to mention a quick check for these pages is a search for – TopContentReport URI does not contain “/$”

I hope you dont mind if I suggest some improvements to your filters:

— 1st filter to change “/index.php” to “/”
RequestURI ^/(index|default|home)\.(html?|php|aspx?|cfm)(.*)  
Output RequestURI /$A3

— 2nd filter used to add the missing slash:
RequestURI ^/([/_a-z0-9\-]+[^/])$
Output RequestURI /$A1/

Or ^/([/_ a-z0-9\-\^,+@%=.*?$!|(){};~#:`'"]+[^/])$

Thanks

Phil.

Also, there are some typo`s…

Typo1:the dot should be \escapped:
Corrected:/index\.php$

Typo2:I’m searching for anything ending in “/index.html”
Corrected:I’m searching for anything ending in “/index.php”

Typo3:/careers/jobs.php
Corrected:/careers/index.php

Jonathan Weber Jonathan says:

Tristan — totally agree that it’s better to fix this in rewrites to the URLs on the server instead. Unfortunately, some folks just don’t have the ability or access they need to make those kind of changes, so cleaning up the data after the fact is their only option.

Phil, thanks for the typos, fixed now. (The “jobs.php” wasn’t a typo, however; I wanted to point out that we might have *many* page filenames in a directory, not just the default page.)

Your suggestions for regexes might be useful for lots of folks depending on their page names/extensions and the allowable characters in their URLs. I think [/_ a-z0-9\-\^,+@%=.*?$!|(){};~#:`'"] might be an overly broad character set for lots of examples, however. We don’t want to end up with a slash at the end of URLs like this, for example:
/somepage.php?query1=1234&query2=5678/

Characters like ? and . can be clues that we have an individual page name, not a directory.

Greg says:

Is this fixed when using the All in One SEO plugin when using WordPress? It uses rel=”canonical” to fix the urls? Am I right in saying that’s what it does?

Tim Schneider says:

This is also a good practice to keep in mind for Campaign tracking as parameters are case sensitive when recorded.

ie: for medium, email and Email will show up as 2 separate lines in reports.

In the first screen shot above, Filter Field has options for all your campaign parameters.

Eric says:

You have to be real careful with the “Default Page” field in Analytics setting. Whatever value you put there, GA is going to add that value to the end of EVERY URL that ends with a slash. For example, if I put Default.aspx in there, and a visitor visits http://domain.name/ , analytics will create a non-existent URL and record a visit to it: http://domain.name/Default.aspx . I noticed GA doing this a few days ago, totally messing up my stats.

Best to just leave it blank in my opinion and deal with the duplication.

Chris says:

rel=”canonical” lets the search engines know which page is preferred. If you can use 301 redirects that would be better for page rank.

Mike says:

Why dont you just remove the trailing slash instead of adding it…?

Search: /$
Replace:

no?

Richard says:

hee hee I liked your thinking Mike. But I am trying a similar solution to yours but GA wont let me save the filter. Does the replace have to have something in it? Or can it be left blank. If it can’t be left blank what reg-ex would return a blank.

David says:

Thanks for posting this, not enough quality posts like this facing the nitty gritty of what analytics can do :-)

Drew says:

Hi and thanks for this encouraging post :)

I notice that you had mentioned that it would be superior to do this on the server level… I could use a helping hand on this one matter :)

Question (if going the server redirect route):

How would you handle tracking parameters?

E.g. http://www.domain.com/?ref=adwords instead of http://www.domain.com/index.php?ref=adwords

Is this ok to do? Do you foresee any issues that this? I’m having difficulty finding a precedent of this behavior

Drew — this is absolutely OK, and in fact I would recommend that you preserve the query string during any redirects you put in place.

Jeremy Campbell says:

Correct me if I’m wrong here, but this would not add a slash before the query string of a URL that didn’t have a slash but did have a query string. IE:
http://www.example.com/script?query=string
would not become
http://www.example.com/script/?query=string

Is that right? Any solution for this? We have this problem a lot -and just as much with URLs with the query string on there as not.

Thanks,
Jeremy

Jeremy — You are correct, this filter won’t add a slash before a ? signaling a query string. I’d use a second filter for that, with something like this:

Field A: Request URI (.*)([^/])\?(.*)
Output: Request URI $A1$A2/?$A3

Test this out, but it should find any character that isn’t a slash [^/} before a question mark, and then add the slash in.

zb1zjl says:

Onwrli Kpxatqdbt cheap ray ban uk Zbinyfgf Khdjwwbv http://tradeome.com/
Tqpfpu Kvqpie ray ban sunglasses Jnnglchn Kintdtxwa http://tradeome.com/