Cleaning up URLs in Google Analytics
We wrote recently about a visitor who asked about enforcing capitalization consistency in URLs in Google Analytics. This is a pretty common thing.
In fact, there are a variety of ways your URLs can show up in inconsistent ways in Google Analytics. What you should recognize is, whatever the URL of the page is in your browser is what Google Analytics records. Now, often your webserver treats slightly different URLs as exactly the same page (differences in capitalization, leaving off a trailing slash, and so on). So if multiple versions of a URL are, in actuality, the same page, we want to clean up those URLs in Google Analytics.
Default pages
Here’s one scenario: you look in your Top Content report and you see your home page in a couple of places, like this:
- /
- /index.php
Or something like that. (Remember the URLs we see here are only the part after the .com or .org or whatever. So one of these represents http://www.example.com/ and one represents http://www.example.com/index.php.) We know these are both really the home page, but seeing them as two separate URLs in this report isn’t very helpful. We have to add up the pageviews for both to see what the total number is.
Google Analytics gives us a simple way to fix this. It’s in your Profile Settings and it’s this one right here:
Here, we can just put in “index.php” as the default page. Now Google Analytics will just add “index.php” to any URL that ends in a slash. Tada!
Multiple default pages
That doesn’t work for every scenario, however. Consider this:
- /
- /home.php
- /careers/
- /careers/index.php
Well, we can’t go and use the “Default page” setting from above, because now there are multiple possibilities, depending on where we are in the site.
Or, for that matter, what if we like the nice, clean, trailing slash URLs and want to get rid of all the index.php?
Well, this you can do with a Search and Replace filter. The setup looks like this:
Notice that I’m searching for anything ending in “/index.php” (the dollar sign means “end with” in Regular Expressions). I’m replacing that with just the slash. In the example above with both “index.php” and “home.php”, I could just create two filters, one for each one. Once I’m done, for my data going forward, I just get the trailing-slash versions of the URLs.
Trailing Slashes
Here’s one more thing that can be a problem, and this one is really challenging:
- /careers
- /careers/
- /careers/index.php
We’ve already solved the “index.php” problem. But notice we also have a problem with slashes. A lot of webservers automatically correct for this kind of thing with redirects. (Here’s some information on how to do it with Apache.) But if yours doesn’t, you can fix the data in Google Analytics (again, with filters).
This one is a little hard, because the patterns we want to match are kind of ambiguous. Here’s what I came up with, but chime in on the comments if you have a cleaner solution.
So, here’s the regular expression I used to match these URLs:
^(/[a-z0-9/_\-]*[^/])$
OK, so it starts with a slash (duh), then it contains 0 or more characters that are alphabetic (a-z), numeric (0-9), or an underscore or hyphen. (You may have to adjust a little if you have other characters in your URLs). Then it ends with a character that is NOT A SLASH. (That’s the important part.)
Why such a specific Regular Expression? Why not just:
[^/]$
That says “ends with any character that is not a slash”. Well, unfortunately that’s probably not specific enough. Because we might have pages like the following:
- /careers/jobs.php
- /careers/?search=web%20analyst
Notice that neither of those end in a slash, but they’re not the kind of URLs we want to end in a slash, either. So we need a regular expression that doesn’t match a few key characters (like the dot and question mark) that clue us in we don’t have just a directory name, but a full page in the URL.
So then the Advanced Filter just grabs the original part of the URL and appends a slash to it.
So that’s a variety of instances in which you have opportunities to clean up URLs and improve the data you have in Google Analytics.
About Jonathan Weber
Jonathan Weber is the Data Evangelist at LunaMetrics. He spreads the principles of analytics through our training seminars all over the East coast. Before he caught the analytics bug, he worked in information architecture. He holds a Master’s degree from the University of Pittsburgh School of Information Sciences. Jonathan’s breadth of knowledge – from statistics to analysis to library science – is somewhat overwhelming.








I think this sort of thing is easier to fix in Apache rewrite rules and redirects. Also in the careful work practices of not linking to bad versions of the links so those versions do not get indexed in search engines.
There are always links that slip through so maybe both are needed?
Well, I was really sure Jonathan was going to show how to make sure all the URLs *didn’t* end in a slash… the problem with that idea would be that the home page is often just / and would need to have its own filter first, so that it could be handled as a separate case. But when I asked him why he didn’t do it the reverse way, i.e. no ending slashes and a special home page filter, he pointed out that if all the URLs ended without slashes, there would be another problem — the Content Drilldown reports wouldn’t be able to understand the directory structure well.
Hi,
Interesting post, and a common problem with website that use both “/” and “/index.php”, or just missing slash pages. The GA documentation on he use of default page is very poor & misleading, so your blog post adds clarity.
Note: You forgot to mention a quick check for these pages is a search for – TopContentReport URI does not contain “/$”
I hope you dont mind if I suggest some improvements to your filters:
— 1st filter to change “/index.php” to “/”
RequestURI ^/(index|default|home)\.(html?|php|aspx?|cfm)(.*)
Output RequestURI /$A3
— 2nd filter used to add the missing slash:
RequestURI ^/([/_a-z0-9\-]+[^/])$
Output RequestURI /$A1/
Or ^/([/_ a-z0-9\-\^,+@%=.*?$!|(){};~#:`'"]+[^/])$
Thanks
Phil.
Also, there are some typo`s…
Typo1:the dot should be \escapped:
Corrected:/index\.php$
Typo2:I’m searching for anything ending in “/index.html”
Corrected:I’m searching for anything ending in “/index.php”
Typo3:/careers/jobs.php
Corrected:/careers/index.php
Tristan — totally agree that it’s better to fix this in rewrites to the URLs on the server instead. Unfortunately, some folks just don’t have the ability or access they need to make those kind of changes, so cleaning up the data after the fact is their only option.
Phil, thanks for the typos, fixed now. (The “jobs.php” wasn’t a typo, however; I wanted to point out that we might have *many* page filenames in a directory, not just the default page.)
Your suggestions for regexes might be useful for lots of folks depending on their page names/extensions and the allowable characters in their URLs. I think [/_ a-z0-9\-\^,+@%=.*?$!|(){};~#:`'"] might be an overly broad character set for lots of examples, however. We don’t want to end up with a slash at the end of URLs like this, for example:
/somepage.php?query1=1234&query2=5678/
Characters like ? and . can be clues that we have an individual page name, not a directory.
Is this fixed when using the All in One SEO plugin when using WordPress? It uses rel=”canonical” to fix the urls? Am I right in saying that’s what it does?
This is also a good practice to keep in mind for Campaign tracking as parameters are case sensitive when recorded.
ie: for medium, email and Email will show up as 2 separate lines in reports.
In the first screen shot above, Filter Field has options for all your campaign parameters.
You have to be real careful with the “Default Page” field in Analytics setting. Whatever value you put there, GA is going to add that value to the end of EVERY URL that ends with a slash. For example, if I put Default.aspx in there, and a visitor visits http://domain.name/ , analytics will create a non-existent URL and record a visit to it: http://domain.name/Default.aspx . I noticed GA doing this a few days ago, totally messing up my stats.
Best to just leave it blank in my opinion and deal with the duplication.
rel=”canonical” lets the search engines know which page is preferred. If you can use 301 redirects that would be better for page rank.
Why dont you just remove the trailing slash instead of adding it…?
Search: /$
Replace:
no?
hee hee I liked your thinking Mike. But I am trying a similar solution to yours but GA wont let me save the filter. Does the replace have to have something in it? Or can it be left blank. If it can’t be left blank what reg-ex would return a blank.
Bilete avion
Cauta bilete
Thanks for posting this, not enough quality posts like this facing the nitty gritty of what analytics can do