Cleaning up URLs in Google Analytics

By /

December 17, 2010

We wrote recently about a visitor who asked about enforcing capitalization consistency in URLs in Google Analytics. This is a pretty common thing.

In fact, there are a variety of ways your URLs can show up in inconsistent ways in Google Analytics. What you should recognize is, whatever the URL of the page is in your browser is what Google Analytics records. Now, often your webserver treats slightly different URLs as exactly the same page (differences in capitalization, leaving off a trailing slash, and so on). So if multiple versions of a URL are, in actuality, the same page, we want to clean up those URLs in Google Analytics.

Default pages

Here’s one scenario: you look in your Top Content report and you see your home page in a couple of places, like this:

  • /
  • /index.php

Or something like that. (Remember the URLs we see here are only the part after the .com or .org or whatever. So one of these represents http://www.example.com/ and one represents http://www.example.com/index.php.) We know these are both really the home page, but seeing them as two separate URLs in this report isn’t very helpful. We have to add up the pageviews for both to see what the total number is.

Google Analytics gives us a simple way to fix this. It’s in your Profile Settings and it’s this one right here:

Here, we can just put in “index.php” as the default page. Now Google Analytics will just add “index.php” to any URL that ends in a slash. Tada!

Multiple default pages

That doesn’t work for every scenario, however. Consider this:

  • /
  • /home.php
  • /careers/
  • /careers/index.php

Well, we can’t go and use the “Default page” setting from above, because now there are multiple possibilities, depending on where we are in the site.

Or, for that matter, what if we like the nice, clean, trailing slash URLs and want to get rid of all the index.php?

Well, this you can do with a Search and Replace filter. The setup looks like this:

Notice that I’m searching for anything ending in “/index.php” (the dollar sign means “end with” in Regular Expressions). I’m replacing that with just the slash. In the example above with both “index.php” and “home.php”, I could just create two filters, one for each one. Once I’m done, for my data going forward, I just get the trailing-slash versions of the URLs.

Trailing Slashes

Here’s one more thing that can be a problem, and this one is really challenging:

  • /careers
  • /careers/
  • /careers/index.php

We’ve already solved the “index.php” problem. But notice we also have a problem with slashes. A lot of webservers automatically correct for this kind of thing with redirects. (Here’s some information on how to do it with Apache.) But if yours doesn’t, you can fix the data in Google Analytics (again, with filters).

This one is a little hard, because the patterns we want to match are kind of ambiguous. Here’s what I came up with, but chime in on the comments if you have a cleaner solution.

So, here’s the regular expression I used to match these URLs:

^(/[a-z0-9/_\-]*[^/])$

OK, so it starts with a slash (duh), then it contains 0 or more characters that are alphabetic (a-z), numeric (0-9), or an underscore or hyphen. (You may have to adjust a little if you have other characters in your URLs). Then it ends with a character that is NOT A SLASH. (That’s the important part.)

Why such a specific Regular Expression? Why not just:

[^/]$

That says “ends with any character that is not a slash”. Well, unfortunately that’s probably not specific enough. Because we might have pages like the following:

  • /careers/jobs.php
  • /careers/?search=web%20analyst

Notice that neither of those end in a slash, but they’re not the kind of URLs we want to end in a slash, either. So we need a regular expression that doesn’t match a few key characters (like the dot and question mark) that clue us in we don’t have just a directory name, but a full page in the URL.

So then the Advanced Filter just grabs the original part of the URL and appends a slash to it.

So that’s a variety of instances in which you have opportunities to clean up URLs and improve the data you have in Google Analytics.

Jonathan Weber is our Data Evangelist, focusing on bringing the strategic value of data analysis to our customers. He spreads the principles of analytics through our training seminars and is currently writing a book on Google Analytics & Tag Manager. Before he caught the analytics bug, he worked in information architecture. Away from the computer you can find him as a flower farmer and plant geek.

  • Tristan Bailey

    I think this sort of thing is easier to fix in Apache rewrite rules and redirects. Also in the careful work practices of not linking to bad versions of the links so those versions do not get indexed in search engines.
    There are always links that slip through so maybe both are needed?

  • http://www.lunametrics.com Robbin

    Well, I was really sure Jonathan was going to show how to make sure all the URLs *didn’t* end in a slash… the problem with that idea would be that the home page is often just / and would need to have its own filter first, so that it could be handled as a separate case. But when I asked him why he didn’t do it the reverse way, i.e. no ending slashes and a special home page filter, he pointed out that if all the URLs ended without slashes, there would be another problem — the Content Drilldown reports wouldn’t be able to understand the directory structure well.

  • Phil

    Hi,

    Interesting post, and a common problem with website that use both “/” and “/index.php”, or just missing slash pages. The GA documentation on he use of default page is very poor & misleading, so your blog post adds clarity.

    Note: You forgot to mention a quick check for these pages is a search for – TopContentReport URI does not contain “/$”

    I hope you dont mind if I suggest some improvements to your filters:

    — 1st filter to change “/index.php” to “/”
    RequestURI ^/(index|default|home).(html?|php|aspx?|cfm)(.*)  
    Output RequestURI /$A3

    — 2nd filter used to add the missing slash:
    RequestURI ^/([/_a-z0-9-]+[^/])$
    Output RequestURI /$A1/

    Or ^/([/_ a-z0-9-^,+@%=.*?$!|(){};~#:'"]+[^/])$

    Thanks

    Phil.

    Also, there are some typos…

    Typo1:the dot should be escapped:
    Corrected:/index.php$

    Typo2:I’m searching for anything ending in “/index.html”
    Corrected:I’m searching for anything ending in “/index.php”

    Typo3:/careers/jobs.php
    Corrected:/careers/index.php

  • Jonathan

    Tristan — totally agree that it’s better to fix this in rewrites to the URLs on the server instead. Unfortunately, some folks just don’t have the ability or access they need to make those kind of changes, so cleaning up the data after the fact is their only option.

    Phil, thanks for the typos, fixed now. (The “jobs.php” wasn’t a typo, however; I wanted to point out that we might have *many* page filenames in a directory, not just the default page.)

    Your suggestions for regexes might be useful for lots of folks depending on their page names/extensions and the allowable characters in their URLs. I think [/_ a-z0-9-^,+@%=.*?$!|(){};~#:`'”] might be an overly broad character set for lots of examples, however. We don’t want to end up with a slash at the end of URLs like this, for example:
    /somepage.php?query1=1234&query2=5678/

    Characters like ? and . can be clues that we have an individual page name, not a directory.

  • Greg

    Is this fixed when using the All in One SEO plugin when using WordPress? It uses rel=”canonical” to fix the urls? Am I right in saying that’s what it does?

  • Tim Schneider

    This is also a good practice to keep in mind for Campaign tracking as parameters are case sensitive when recorded.

    ie: for medium, email and Email will show up as 2 separate lines in reports.

    In the first screen shot above, Filter Field has options for all your campaign parameters.

  • Eric

    You have to be real careful with the “Default Page” field in Analytics setting. Whatever value you put there, GA is going to add that value to the end of EVERY URL that ends with a slash. For example, if I put Default.aspx in there, and a visitor visits http://domain.name/ , analytics will create a non-existent URL and record a visit to it: http://domain.name/Default.aspx . I noticed GA doing this a few days ago, totally messing up my stats.

    Best to just leave it blank in my opinion and deal with the duplication.

  • http://truecleanjacksonville.com Chris

    rel=”canonical” lets the search engines know which page is preferred. If you can use 301 redirects that would be better for page rank.

  • Mike

    Why dont you just remove the trailing slash instead of adding it…?

    Search: /$
    Replace:

    no?

  • http://www.musto.com Richard

    hee hee I liked your thinking Mike. But I am trying a similar solution to yours but GA wont let me save the filter. Does the replace have to have something in it? Or can it be left blank. If it can’t be left blank what reg-ex would return a blank.

  • http://www.ramsdensforcash.co.uk David

    Thanks for posting this, not enough quality posts like this facing the nitty gritty of what analytics can do :-)

  • Drew

    Hi and thanks for this encouraging post :)

    I notice that you had mentioned that it would be superior to do this on the server level… I could use a helping hand on this one matter :)

    Question (if going the server redirect route):

    How would you handle tracking parameters?

    E.g. http://www.domain.com/?ref=adwords instead of http://www.domain.com/index.php?ref=adwords

    Is this ok to do? Do you foresee any issues that this? I’m having difficulty finding a precedent of this behavior

  • http://www.lunametrics.com/ Jonathan Weber

    Drew — this is absolutely OK, and in fact I would recommend that you preserve the query string during any redirects you put in place.

  • Jeremy Campbell

    Correct me if I’m wrong here, but this would not add a slash before the query string of a URL that didn’t have a slash but did have a query string. IE:
    http://www.example.com/script?query=string
    would not become
    http://www.example.com/script/?query=string

    Is that right? Any solution for this? We have this problem a lot -and just as much with URLs with the query string on there as not.

    Thanks,
    Jeremy

  • http://www.lunametrics.com/ Jonathan Weber

    Jeremy — You are correct, this filter won’t add a slash before a ? signaling a query string. I’d use a second filter for that, with something like this:

    Field A: Request URI (.*)([^/])?(.*)
    Output: Request URI $A1$A2/?$A3

    Test this out, but it should find any character that isn’t a slash [^/} before a question mark, and then add the slash in.

  • http://tradeome.com/ zb1zjl

    Onwrli Kpxatqdbt cheap ray ban uk Zbinyfgf Khdjwwbv http://tradeome.com/
    Tqpfpu Kvqpie ray ban sunglasses Jnnglchn Kintdtxwa http://tradeome.com/

  • Valerie

    In the Google Analytics snippet for trailing slashes, you have the dash correctly escaped, but in the copy-able text in the paragraph space below that, the escape character is missing.

    ^(/[a-z0-9/_-]*[^/])$ instead of ^(/[a-z0-9/_-]*[^/])$

    You might want to correct that. Great article though, very helpful!

    • http://www.uglyfashionmedia.com/ Eoin

      Nice one! Legend!

      Is this JavaScript?

  • Jackie Currie

    I’ve had a new issue crop up recently. A month or so ago, GA began reporting many (most? all?) of my posts in duplicate. It had never happened in the 3 years prior. Suddenly, it’s reporting a URL with a hyphen in it, AND the same URL with an vertical slash. Not sure if this was due to some change in the Yoast plugin or something else, but it’s driving me nuts. Here’s an example:
    Crazy Banana Cake with Cream Cheese Icing | Happy Hooligans
    Crazy Banana Cake with Cream Cheese Icing – Happy Hooligans

    I’d love to know how to correct this duplicate reporting.

    • http://www.prince-asfi.com/ Asfandyar Khan

      It is probably because of your Yoast Plugin, You have to use this plugin very carefully, I was using Yoast plugin before but i have changed it to All one in Seo Pack now and it easy to use and works best for me. http://www.freakyincome.com/

Contact Us.

LunaMetrics

24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.