Upcoming LunaMetrics Seminars
Seattle, Nov 3-7 New York City, Nov 17-21 Washington DC, Dec 1-5 Los Angeles - Anaheim, Dec 8-12

How to Filter Bot Traffic From Your Google Analytics

UPDATE: July 30, 2014 – Google announced a feature to automatically filter out bots and spiders. Learn more here.

party crashers

Don’t let bad data crash your analytics party.

One of the benefits of client-side, tag-based analytics (as opposed to server side analytics) is that you generally don’t have to filter out traffic from bots.

However, it seems lately that some bots (*cough, cough* Microsoft) have been showing up in Google Analytics like an uninvited guest, crashing the data party.

For example, this is the graph of traffic to a site showing visits from a Bing bot:

Visits from Bing Bot

Overall visits to this site increased (artificially) 80.5% on August 14 (see the big spike to the right) – with about 90,000 visits from Bing’s crawler.

This is bad for (at least) 2 main reasons:

  • Bot visits skew your data, artificially inflating visits and unique visitors, increasing bounce rate, and decreasing pages/visit, average visit duration, goal conversion rate, ecommerce conversion rate, etc.
  • Increases the negative side effects of sampled data in Google Analytics. Even though the visits are from bots, They still count toward the visits when it comes to sampling.

Below, I will show you:

  • How to find out if you have this problem with bot traffic
  • How to get rid of the bot traffic from your reports

Uncovering Bot Traffic

You might be wondering if you have this problem in the first place. To find out if bots are crashing your data party, go to the Audience > Technology > Browser & OS report. The browser to look for is Mozilla Compatible Agent.

mozilla compatible agent

Now, just because the browser is Mozilla Compatible Agent doesn’t mean it’s a bot. There are other non-bots that use that user agent (some browsers in mobile apps, for example).

If you do have a problem with bot traffic, however, this is the canary in the coal mine.

If you see an unusually high number of visits from Mozilla Compatible Agent, you can go over to your Audience > Technology > Network report and apply this advanced segment (to show only visits where the Browser contains Mozilla Compatible Agent).

Look for visits from the following service providers:

  • microsoft corp
  • google inc.
  • yahoo! inc.
  • inktomi corporation
  • stumbleupon inc.

Also pay attention to the metrics – visits from bots will likely have close to 100% new visits, 100% bounce rate, 0o:oo:00 average visit duration, and 1 pages/visit.

Kicking out these uninvited guests

No one likes a party crasher, and you’ll want to kick them out quickly. The easiest way to do this is to create and apply a filter to your view (profile) that excludes based on the ISP Organization (i.e. Service Provider).

Set your filter up as follows:

Google Analytics filter to exclude smart bots

For the Filter Pattern, use the following regular expression:

^(microsoft corp(oration)?|inktomi corporation|yahoo! inc\.|google inc\.|stumbleupon inc\.)$|gomez

This will take care of the main offenders. Of course, if you noticed other service providers in your data that look like bots that aren’t included in the filter above, be sure to include them!

Unfortunately, filters only apply to your data moving forward (not to historical data). So to remove these bot visits from your historical data, you’ll need to create an advanced segment (or just copy this one).

Making sure they never get through the front door

Unfortunately, even though you can filter these bots out of your data, they still count toward the total number of “visits” to your site from GA’s perspective. To put it another way, these bot visits can cause your data to be heavily sampled, even though you’re filtering them out.

Sampling happens at the web property level whenever there are more than 250,000* visits  for the selected date range and you request data that is not pre-calculated (when you apply an advanced segment, secondary  dimension, custom report, etc.).

* This sample size can be adjusted to a maximum of 500,000 visits

We’ve covered the problems with sampling (and how to work around it) before.

To get rid of this unwanted guest once and for all requires a more sophisticated solution, which involves modifying your Google Analtyics tracking code.

To give a high level view, you would need to wrap your tracking code in a function that checks whether the “visitor” is human or a bot; if they are human, execute the Google Analytics tracking code, else skip the tracking code altogether. To keep with the analogy, this would be like having a bouncer at the front door, only letting real visitors past the velvet rope and telling the bots to “take a hike!”

Are you interested?

If you’re interested in a specific solution for doing the above, let me know in the comments. If there’s enough interest, I’ll follow up with the code to do it.

 

Jim Gianoglio

About Jim Gianoglio

Jim Gianoglio is our Digital Analytics Engineer. He works with implementation, analysis of Google Analytics, and spearheads the LunaMetrics Google Analytics seminars across the country. Want to see him in action? He'll be heading our Google Analytics training in Los Angeles. Before succumbing to the siren song of analytics, he led the SEO campaigns of Fortune 500 companies in the insurance, retail and CPG industries. Things you didn’t know about Jim: he runs marathons, photographs weddings and has done voiceovers for TV commercials.

http://www.lunametrics.com/blog/2013/09/05/filter-bots-google-analytics/

54 Responses to “How to Filter Bot Traffic From Your Google Analytics”

Unfortunately, this only addresses one area of the bot problem. Rogue crawlers from SEOs doing link research can be a much larger issue for some sites and their tools are becoming more sophisticated. Still, this is a very nice, timely tutorial. Kudos to you.

Michael, what behaviour would these SEO crawlers trigger that’s different to the likes of Slurp? Any examples?

Jim, thanks for writing this. Here are some great contributions to the topic from Dan Barker and Phil Pearce http://p.barker.dj/autobots

@Michael – are you talking about crawlers like Screaming Frog, OutWit, etc.? I haven’t seen them show up in GA. Certainly, crawlers (whether it’s for SEO purposes or other) present other issues (consuming server resources and bandwidth, etc.).

@Carmen – thanks for the link – an excellent idea to set up an alert for bot traffic!

I also noticed something interesting the other week. For some clients I read the _utma cookie and use the random visitor ID to set a visitor-level custom variable. So everyone should have this set as it’s based on the GA cookie.

I then noticed a sizeable segment of visitors did not have this custom variable set. I created a segment for it ( custom variable does not match [a-z0-9]) and that’s how I spotted microsoft corp and a couple of other bots. If the GA cookie is set then there’s no reason why this custom var should not be set as well, correct?

I thought this might hold some clues as to how to identify them but I’ve yet to test it. Have you seen anything similar?

That’s very interesting. I haven’t seen that specifically, but it makes me wonder why it’s not setting the custom var.

Thanks for adding to our understanding of this!

yisrael says:

Hi Jim ,
Thank you for this great tutorial
I think to have the best of both worlds you can create a function to exclude your standard GA tracking from running and also run a different one for a different GA account. This way your true visitor data is not skewed and you can data mine which bots are coming and any effect that might have on the performance of your site or see when Google has scraped your site etc….
A nice addition is if you use the chrome Extension chartelligence. they have an API and you can send events to a url endpoint so you can overlay bot visits onto the time chart in GA
What do you think?

@yisrael – I like the way you think!

I hadn’t heard of Chartelligence before – but it looks promising. Kind of like annotations on steroids.

The only thing with tracking bots this way is that you still won’t be tracking all (or even many) bots. Most bots never show up in GA – it’s only a handful that do. But still interesting data, nonetheless.

Nate Shivar says:

Hi Jim – thanks for this post. I’ve had several clients who have had problems with Microsoft related bot traffic.

I’ve been using Advanced Segments to filter them out, but would love to see a js tag solution.

Any idea why this problem would show up on some Analytics profiles and not others that are set up the same way? I’d assume that the same BingBot is crawling the same sites. Are there certain conditions that you’ve found to make it fire the code and not others?

@Nate – I haven’t been able to discover the cause, just the symptom. I’m not sure why some sites see the spikes in traffic from this bot and others don’t – but it’s generally more noticeable in large sites.

Looking at a couple dozen sites ranging from small (~10,000 visits/month) to large (~75 million visits/month) they all have some traffic from this bot. But it’s not consistent. Some sites see spikes on specific dates, others see continuous low levels of visits. And some sites see a gradual but steady increase over the past several months.

Paul Dybala says:

Thanks Jim – this is perfect. I have seen this happen 3 times on several different sites we manage.

On the option to have the GA code served based on the user agent, would you have any concerns that Google would see that as cloaking?

Jodie says:

I have a really small niche website and have searched high and low for this answer. I noticed the problem about 4-5 months ago and have used filters to exclude data, which has worked for about 2 months. But the problem is back big time. It’s an MSN bot causing the problems and I think a google bot has gotten into the act too. HELP!Seriously…..

I could add one to the list – rackspace cloud servers, which is showing a network domain of cloud-ips.com. Major amounts of bot traffic over the last month.

A related one also showed up which is clearly a bot, but an odd one. It’s also showing a network domain of cloud-ips.com, but a service provider of rackspace hosting. Bounce rate is 50%, 100% new visits, 2.99 pages/visit, and only came to the site for exactly 1 month, 48 times per day, before disappearing.

So rackspace cloud servers|rackspace hosting.

Thanks for uncovering this. I started noticing this issue on one of our dealer group’s website. At first I thought it was tied into a third party e-mail campaign we started at about the same time, but then I found this article and confirmed that what GA was telling me matched your info.

I would love to be able to give our website provider specific instructions on how to filter this on with a script, as the UI for the end user is restricted and “dumb” (I cannot ad my own code).

Farah says:

Thumbs up for such an informative post Jim! I noticed this issue on my clients website too. Have you tried using Chartelligence?

Thanks for the information. I can confirm seeing this on some websites. Another piece of evidence to look for is the “Technology > Network Domain” dimension. All from “msn.com”.

Microsoft is doing a terrible thing here.

Mitul says:

Thanks for the info Jim. Out of nowhere I saw a spike in traffic on my site from Bellingham WA and further investigation confirmed its from msn.com. I have just added the filter however the traffic from Bellingham still shows in real time reporting. Is there any way to get rid of this? Thanks!

Excluding data based on ISP Org does not stop the bots and many bots are now frequently changing their IP addresses and using consumer ISPs like Comcast or service providers like Amazon EC2… removing them from your data also does not stop form spam, fruad, nor content theft. WAF and rate limits will not suffice either.

The best way to stop these problems is to manage bots in-line and in real time so that they never reach your web servers… check out Distil Networks for one good option.

Robin says:

Hi Jim, this has been a great help. We sometimes see 33% of visits coming from Directs which seems to be caused by these microsoft coop bots. In one month I have over 840,000 visits from this service provider, all showing 100% new visits, 100% bounce zero avg time on page and of course zero transactions. Straight away i thought about filtering these visits out, we have several ‘views’ so i could place the filter on a spare view to start with. But when i drilled the data down with the secondary dimension medium, I noticed that over 730,000 were directs – certainly bots. But then around 111,000 were coming from a medium ppc which is one of our paid channels. But still the stats are showing 100% new visits, 100% bounce zero avg visit duration and still zero transactions. It is more than likely still bots, but I am concerned with filtering out data that could possibly be coming from a paid channel.

Can you please shed some light on this? Is it still ok to filter these visits out? or is there a way to filter our visits from this IP Organization but only visits that are (direct)?

Many thanks in advance.
Cheers
Robin

Rob Flaherty says:

Jim – Great post. This seems to be one of the few pieces of writing on this and I think it’s helping a lot of people.

For anyone interested in additional techniques to identify bot traffic, I wrote something about detecting bots based on behavior instead of ISP/Browser.

larric says:

Thanks for the great info. I’m still new to figuring out how my site is getting views. I thought it was working but I then wondered why was it not converting well. And then wondered maybe it’s bots. I’m going to set the settings and determine the changes. Thanks again.

Roddy says:

MSN smartbots have recently started showing up in my stats again – the ISP organisation field changed from microsoft corp to microsoft corporation, which the above regex doesn’t catch.

@Roddy –

I noticed this the other day too. I just updated the Regex above in the post, and here it is too, so you don’t have to scroll all the way back up :)

^(microsoft corp(oration)?|inktomi corporation|yahoo! inc\.|google inc\.|stumbleupon inc\.)$|gomez

James Jensen says:

This is not some kind of new issue with regard to the Web. In 1998 the log files from a Web server I managed were reporting bots and other traffic that was not “human”. Filters will not get all the bots either. This will be an ongoing issue for those who care until some legal eagle figures they can make a killing with some type of litigation…. Be it consumer fraud or some kind of class action.

larric says:

thanks for the tips and easy instructions. Im gonna try it and hopefully that works.

Melanie says:

Hi Jim! Thank you for this article. I’d be interested in the GA code if you have time to post it. I’ll give the rest of your suggestions a shot in the meantime. Thanks again!

Kathy says:

Great post Jim! Just recently started seeing bot traffic and wasn’t sure what it was. Thanks for the tip, very helpful!

Finn says:

When I try and make a filter as suggested I get an error message from GA: “One or more fields contains invalid data. Plese fix and submit again”

@Finn – there might be an issue with copying from this post and pasting directly into the filter pattern field. Maybe formatting of the text is causing some errors?

Try copying the filter pattern from the post:

^(microsoft corp(oration)?|inktomi corporation|yahoo! inc\.|google inc\.|stumbleupon inc\.)$|gomez

and pasting it into a text editor (like Notepad). Then copy the text from the text editor and into the filter pattern field.

Let me know if that doesn’t work.

Jente says:

Hi Jim,

Thanks for the clear explanation. While looking around some of our websites for bot behaviour I ran into some questions. Hope you can help me out here…

1) Isn’t it dangerous to only filter on ISP Organization? I get the impression that by filtering the ISP organizations that show bot behaviour, I also exclude visits from those ISP that are human. Microsoft Corporation for instance also show human-like behaviour (visits from the Mozilla browser with a 50% bounce, multiple pageviews and an average time on site of 20 seconds).

2) I also noticed a lot of bot-like behavior on other browsers (internet explorer for instance). In fact, the biggest bot-like traffic seems to be coming from (not set). Can I asume that this are bots too? Or are you sure that bots always use the Mozilla Compatible Agent?

Thanks in advance!

@Jente -

Yes, there is a possibility that you will have some false positives, if you have real people visiting your site and their ISP organization is Microsoft Corp. In my experience, this is an incredibly small percentage of overall traffic, and worth sacrificing to get rid of the bot traffic.

However, if you get a lot of “human” traffic from Microsoft, and that’s an important audience segment to you, then you’ll want to make sure you have a View (profile) of the data that includes them.

The bottom line is that bot behavior is not always consistent (browser may not always be Mozilla Compatible Agent, may not always be Microsoft Corp as ISP Org., etc) but you can look for clues that help you filter them out. Another clue is whether or not Java is enabled in the browser (usually, bots will not have it enabled). Again – not always a smoking gun, but another indicator.

Sara says:

Hey Phil! I’m pretty sure i’m having a bot problem but it’s nothing like you described in your article! My bots don’t show their service provider, it’s state “not set”, same with city, etc, etc. They are from Chrome and Firefox. But they always show up when i share a link on Facebook/twitter, always appear as new visitors and always show the country as USA (i’m from Portugal) A second after i share a link to my blog they’re already live at my website. it’s pretty difficult to keep track of my real visits and stats with these ots always crawling. Do you know how i can prevent them from it? Thanks :)

@Sara -

Are you sure these are bots and not just people clicking on the links you share? What do the other metrics look like? If it’s 100% bounce rate, 100% new visits, 1 page/visit, 00:00:00 avg. visit duration, then that would indicate that it is likely be a bot.

If it is bot activity, and doesn’t have a service provider, it becomes more difficult to filter them out. It may not even be possible in GA. Yo basically need to have something that all of these visits have in common that is different than “real” visits to your site.

Just be careful not to filter out visits with a service provider of “not set” – that will exclude real visits too.

Jim Seward says:

I’m having trouble with the copied and pasted regex as well. I get one or more fields contains invalid data.

tried cleaning it through notepad as well

@Jim -

I’ve tried creating this filter on several other accounts and am not getting the error. I tried copying straight from the post and it worked.

Instead of copying/pasting, will it work if you type it in directly?

Robin says:

thanks for sharing this wisdom. very interested in modifying the Google Analtyics tracking code.

Kurt says:

To those having trouble pasting the regex — This happened to me for a different filter. It’s a “bug” in the GA filter setup page; the page is doing validation on hidden fields that it created. Click on the “Predefined Filter” button, and you will probably see the error message. Clear those up and then return to the “Custom Filter.” That should clear things up.

And, thanks, Jim for this article! It’s a great help.

Thanks for the tip Kurt!

Drew says:

Hi, would also be interested in seeing an example of how to implement the bot “bouncer” strategy :)

Sameh says:

Here are more bots (by Service Provider):

- domaintools llc (DomainTools.com bot)
– lezon inc (the company who runs Domain Research Tool, Estibot, Dropping.com, and many other domain tools)
– automattic inc (WordPress)

BTW, One of the easiest ways to block wide range of bots is to block (Language: c) which is the system language of the majority of bots.

Also, one of the easiest ways to spot more bots is the Service Provider (Technology > Network) report. Just filter any Service Provider with bounce rate higher than 99% and with reasonable number of visits (compared to your total visits) and do your investigation. You will be amazed by the amount of bot traffic in your reports. I had no idea how bad was it until recently.

Thanks Sameh for adding to the bot list.

As always, I urge you all to test out the bot list regex in your reports first to make sure you’re not filtering out real traffic too.

For example, I have a client for whom I can’t include “google inc.” in the filter because they get visits (and purchases) from people who have Google Inc. as their ISP (i.e. Google employees shopping while they work).

Lucas says:

Hi Jim!

My site has had many visits with 0 pageviews, 0 duration of the visit and bounce rate, all coming from cities in other countries and direct traffic. I suspect they are crawlers.

Auditioned by Audience> Technology> Browser and these undesirable visitors came from Firefox.

Do you know how I could solve this problem?

Thanks!

Olaf Krüger says:

Hey Jim.

What about the solution you would follow up if enough people are interested in?

Best
Olaf

@Lucas -

If you read the blog post, you’ll find that it provides the necessary information and instructions to solve your problem.

@Olaf -

I have some thoughts on a more in-depth solution to filter out the bots completely, but it’s a tricky subject. I will certainly share what I have when it’s ready – thanks for the interest!

Jackie Chu says:

Amazing blog post and so helpful- thank you

Susan says:

great info. Thanks.
Browser says None, Browser version says Not Set. All the data is 0 bounce rate, 0 page views, 0 avg. time on site. And yet, it is consistently coming from 6 different U.S. cities.

Thank you.
Susan

What is the best way to block this?

Wendy says:

Great information on here thanks.
I installed google analyitics 2 days ago.
In 2 days I have 24 visits of which I have narrowed down to just 5 real visits (and 2 of them were me testing the website). I suspect this is just the start of the bots. Half the visits were from digital ocean inc. All 100% bounce, 1 page, 0s
Is there any way just to filter out all the 100% bounce 1 page 0s no matter where they come from?

@Susan -
You can filter on the browser or browser version (i.e. exclude traffic where browser is “none”), but that would likely cause some false positives. Just check your data first to make sure there is no traffic where browser is “non” that looks like it may be real.

@Wendy -
Digital Ocean is a cloud hosting service for developers, so it may be that someone is using them to host an app that is crawling your site. I’d be careful about filtering them out with such a small sample size, however. One thing you may want to do (if you haven’t already) is set up a new View in your Web Property that has no filters on it at all. Then you can add a filter to your main View to filter out visits with service provider containing “digital ocean”.

jure says:

“So to remove these bot visits from your historical data, you’ll need to create an advanced segment (or just copy this one).”

Have to say that doesn’t seem to work. The data still stays there when checking history.

Any other solution to it?

Tnx!

@jure -

Once you click on the link (“copy this one“) you have to select a View from your Google Analytics account to import the advanced segment configuration. Then, you have to go into the reports in that View and apply the advanced segment. It’s not an “always on” kind of thing, like filters.

Andrew says:

Why would a bot hit one page on my site thousand’s of times a day? What purpose could this possibly serve?

Hi Jim,

Thanks for this great info. I’m having a problem that every time I submit new ads for approval with Facebook or Twitter, their bot visits my site as part of their ad approval process. Do you know how to exclude these bots? I haven’t been able to find a unique pattern to identify them in GA.

These visits are really screwing up my data.

Thanks!

@Aaron -

If there’s no pattern to the Service Providers, you could look at the user agent of the suspect visits.

To do this, you would first have to capture the user agent string (not something that is captured by default in GA). You could do this by modifying your basic tracking code as follows:

(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,’script’,’//www.google-analytics.com/analytics.js’,’ga’);

var userAgent = window.navigator.userAgent;
ga(‘create’, ‘UA-XXXXXX-Y’, ‘example.com’);
ga(‘send’, ‘pageview’, {
‘dimensionXX': userAgent
}
);

Don’t forget to configure your custom dimension first (https://developers.google.com/analytics/devguides/platform/customdimsmets#configuration). In the above code, the Xs in ‘dimensionXX’ need to be replaced with the index of the custom dimension you created for this (1-20).

If the user agents of these Facebook and Twitter bots identify themselves in some way, you could create a filter that excludes based on that custom dimension value. For example, here’s a user agent I found creeping around in some reports:

Mozilla/5.0 (en-us) AppleWebKit/537.36(KHTML, like Gecko; Google-Adwords-DisplayAds-WebRender;) Chrome/27.0.1453Safari/537.36

(By the way, that above user agent had 14 visits, 100% new, 100% bounce rate, Browser – (not set) and Service Provider = google inc.).

Hopefully that helps!

melodic says:

in my may 2014 stats on my site (not google stats, stats from my server provider) it says with 1 visit i had 500 thousand hits. i noticed it right away on my graph because it stuck out like a sore thumb. it gave the ip address it was from and i traced it. when i traced it it said it was from kaiser permanente in maryland. could this of been a bot that got on someones computer at a office in kaisers business and then ran on there computer?

@melodic – that’s possible. At least with the IP address you can now filter out any visits that happen moving forward.