Understanding Bot and Spider Filtering from Google Analytics

Understanding Bot and Spider Filtering from Google Analytics

By /

August 7, 2014

blog-spider

On July 30th, 2014, Google Analytics announced a new feature to automatically exclude bots and spiders from your data. In the view level of the admin area, you now have the option to check a box labeled “Exclude traffic from known bots and spiders”.

Most of the posts I’ve read on the topic are simply mirroring the announcement, and not really talking about why you want to check the box. Maybe a more interesting question would be why would you NOT want to? Still, for most people you’re going to want to ultimately check this box. I’ll tell you why, but also how to test it beforehand.

The Spider-Man Transformer is neither a bot, nor a spider, nor a valid representation of my childhood.

The Spider-Man Transformer is neither a bot nor a spider, nor a valid representation of my childhood.

What are Bots and Spiders?

The first thing to understand is just what a Bot or Spider is. They are basically automated computer programs, not people, that are hitting your website. They do it for various reasons.

Sometimes it’s a search engine looking to list your content on their site. Sometimes it’s a program looking to see if your blog has new content so they can let someone know in their news reader. Sometimes it’s a service that YOU have hired to make sure that your server is up, that it’s loading speed is normal, etc.

Some of the more basic bots don’t run code on your site, like the JavaScript that Google Analytics requires, so you don’t see them in your traffic reports. You will however, see them in your server logs if you have access to those. Many web hosts will charge by the hit on the server, based on their server logs.

Some sites, like LunaMetrics, get tons of these hits that we only have evidence of on the server log level. All you people who have automated services pinging our servers looking for new posts every few seconds are essentially costing us money. It’s ok, we don’t mind. It doesn’t screw up our analytics data.

The Problem With Smart Bots

other-bots

A stereotypical bot spike.

The problems start when you learn that some of these computer programs that are running automatically CAN run the Google Analytics code and WILL show up as a hit in Google Analytics. Sometimes a site will barely get touched by these “smart bots” and you won’t give them a second thought, as they won’t be visiting your site enough to have it skew your insights. Other times you’ll get wild and insane spikes in your data, which you’ll have to deal with.

Dealing with these bots can be a big problem. In the past, you’ve often had to hunt them down to discern them by the browser they’re listed on, the number of pages they hit and other behavior, etc. Once you figured this out you could filter out many of them going forward, but it would still remain in your historical data, and affect sampling.

smartbot-traffic

This is not a human generated traffic pattern.

Because it remains in your historical data, you’ll be forced to use segments to get rid of them when you look at your property, which will often cause sampling for large sites. Even worse, your total sessions will be affected by these bots, so even if you filter them out, you’ll trigger sampling faster in the interface, and it will sample at a much lower inaccurate sample size right out of the gate.

Even when you know about the bots AND deal with them, they can still make your life miserable if you’re an analyst.

conversion-drop

Goals and Conversions can be affected by a spike in segmented traffic.

The problems don’t even stop there, and it’s not just about traffic. For instance, some bots can even log into your site and pretend to be a specific audience segment. Many of these are ones in services YOU pay for, like Webmetrics.

If you don’t filter out these “super smart bots”, they’ll really mess up your data, because you might be looking at a specific audience segment, and see a wild swing in traffic or Ecommerce rate. Or worse the bots ramp up slowly, and you don’t even get a clear indication that something odd happened.

The New Bot and Spider Filtering Feature

Which brings us back to the Google Analytics new offering. This feature will automatically filter all spiders and bots on the IAB/ABC International Spiders & Bots List from your data. This is a list of spiders and bots that is continuously updated and compiled when people find new ones. Generally membership to see this list costs from $4,000 to $14,000 a year, but by checking the little box on your view (and on every view you want to filter them) you get to utilize the list for free. A number of bots and spiders may slip through the list, but usually not for long, and hopefully not long enough to affect your data.

You won’t be able to SEE the actual list they’re using, but you can exclude visits from bots on their list from showing up in your Analytics.

bot

So great, right? Check that box? Not so fast.

Best Practices For Implementing New Filters

I am all for checking the box, but this is a great chance to talk about and implement best practices:

Step 1: Make sure you have an unfiltered view in your property that has zero filters, and which you don’t check the box. This way if there IS some sort of error, you’ll have your data in a pure state to go back and look at, and compare against.

Step 2: Don’t implement it immediately in your main view. We’ve heard reports from people having some problems, or even their Ecommerce being affected. I don’t know how accurate any of these complaints are, but it’s always good practice to put a new filter in a test view first. Create a new test view that mirrors your main one in every other respect, and then check the box.

Let it run for a week or two, and see what sort of difference you have. Investigate major differences and clarify internally what monitoring systems you’re paying for.

Step 3: If you’re happy with the new bot and spider exclusion filter based on this test, then go ahead and implement it in the main view.

I don’t know if this is going to solve all our smart bot and spider problems, but it’s a great start. As someone who recently had to manually exclude hundreds of different IP addresses from a view for a client, I can attest to a single checkbox being a humongous time saver.

So follow the best practices, and hopefully enjoy your cleaner data.

Sayf is a former LunaMetrician and contributor to our blog.

  • http://www.verticalnerve.com Tyson

    This is a great feature add, but what’s unknown to me is how many of the bot spikes this would have prevented in the past. We have clients that routinely get hit with international bots, automated monitoring bots or “attack bots” (like the recent spike affecting AdRoll customers), and it’s hard to tell if the new bot filtering would have prevented these visits.

    Sometimes it feels like spinning plates because as you mentioned, as soon as you do have a bot attack, the data is irrevocably corrupted. I really wish Google would give us an option to create a visitor segment and then permanently purge those sessions from the database (with admin permissions and ample warnings, of course). This has been a problem for too long, and too many analysts are already having to use segments just to get clean data, which is frustrating.

  • Sayf Sharif

    I don’t know either. It’s a big list, so I assume it would remove at least some. We’ve posted in the past about automatically filtering out certain known bot hostnames, but that doesn’t always help prevent all of them.

    I agree that a permanent purge of a segment would be a great solution, but its one I doubt Google will ever pursue.

  • http://www.mobiliodevelopment.com/ Peter Nikolow

    There are few ways of bots/crawlers. Most of them didn’t run JavaScript, but some of them do.

    This check box apply only for bots that can execute JavaScript codes.

    Major issue is that you don’t know what bot will be excluded from that setting. As work around you can wrote some JS code in HTML and to exclude Analytics initialization/execution for some user agents.

  • Sayf Sharif

    Peter, You’re correct. As I mention in the post many bots don’t run JavaScript, and we don’t really care about those as they’re not going to run Google Analytics.

    We have coded different ways to filter bots, and using the user agent can be a good way to segment out some data, but it’s far from perfect, and a number of the Bots are getting smart enough to not use generic user agents. There are ones that look for all intensive purposes like they are running Internet Explorer, and the only way you can suss them out as being non-human is through behavior.

    Even then filtering “Internet Explorer 8 visitors from Denver, Omaha, and London” is a very kludgey brute force way to do it, which almost certainly WILL also eliminate SOME actual humans.

    There’s no perfect solution, but this checkbox is a great way of eliminating known bots rather than having in some cases over 100 different filters (or more) excluding various IP’s from a view.

    It’s not the be all end all, but it’s a great start for most accounts.

  • https://www.akademiaanalytics.pl Maciej Lewiński

    Sayf, could you please share with us a list of bots and spiders which Google Analytics are excluding in default?

  • Sayf Sharif

    Maciej,

    Google hasn’t revealed a full list of the bots and spiders they exclude, either with or without this list. This list that gets excluded via the checkbox, as I mentioned above, is only available for $4,000 to $14,000 from the IAB/ABC:

    http://www.iab.net/1418/spiders

  • http://www.metricks.org Jon

    I’ve created a separate view specifically for testing this compared with my raw view; I was hoping it would deal with the dreaded Semalt referrals, but unfortunately not – they are coming through in each view. Hopefully Semalt get added to the IAB list for their nefarious activities, I’m pretty sick of filtering them out! ( this is a good read if anyone is unsure what Semalt have been doing: blog.nabble.nl/post/93306955157/semalt-infecting-computers-to-spam-the-web )

  • Sayf Sharif

    Good read. Yeah it’s becoming apparent that there are still plenty of bots and spiders that this check box doesn’t hit.

  • http://www.imspgh.com/ Jeremy Cid

    Great article Sayf! I’m curious about what anyone else is seeing in the way of the difference in traffic. I’ve read the average amount of traffic to sites is around 30%, but is anyone seeing anything different?

  • Sayf Sharif

    We’ve seen it vary wildly. Some sites get absolutely pummeled with bot and spider traffic (we’ve seen upwards of 40% on one site if you can believe that) and others get low single digits. I’d love to hear what other people are seeing as well.

  • http://www.imspgh.com/ Jeremy Cid

    I can believe it. I’ve got some clients sitting around 37%, while others are sitting around 15%. I’m curious to see if there’s a significant difference from one industry to another.

  • http://www.conetix.com.au Jamin Andrews

    Great read, I have wondered about the impact since the Google announcement, so thank you for your take and more detail.

  • http://nicolascliche.com/ Nic Cliche

    This is great advice to implement safely this new GA feature.

    We had special problem with Bingbot in October and it causes serious damage to our data integrity…

  • http://www.caominhblog.com Nguyen Cao Minh

    Great post. I know this checkbox but not actually understand it. Now I know why, at some day, i have a specially increase traffic. It causes by tools.pingdom.com – a website speed test
    Thank you

  • http://seaside-soft.com Conrad

    Excellent post. Ok.. I might be wrong here, but I “think”:

    After Google spiders my site, I can drill down and see the activity by selecting “Service Provider” in any of the reports (for example, I could have selected country, operating system, browser, etc., but instead choose Service Provider).

    To exclude the activity from ALL data (including historic data), add a filter to any report: Exclude where Service Provider = Google.

    You might be able to do this for other bots.

    Hope this helps!

  • Conrad

    Adding to my comment above..

    I should be more specific. Here is the exact report drilldown:

    Audience -> Technology -> Network

    One of the entries is “google inc.”. On my report it shows 31 sessions today (I just submitted my site for indexing).

    Hope this helps.

    Conrad

  • Sayf Sharif

    Conrad, absolutely, in fact we have a number of older blog posts focusing on doing just this and isolating out a variety of bots coming from microsoft, amazon, inktomi, and more. Jim’s post here: http://www.lunametrics.com/blog/2013/09/05/filter-bots-google-analytics/ has a nice regular expression to use for this very technique.

    Unfortunately, there are a number of bots that successfully hide from this, and come from a variety of ISP’s, pretend to be a variety of browsers, and even are able to log into your system under certain conditions, and they’re very hard to distinguish in this manner.

    However we’ve seen those sorts of filters remove 30% of traffic from a site before, and it was very obviously bot traffic (1 page hit, 100% bounce, mozilla compatible agent for a browser, etc).

  • http://blog.flowl.info Dan

    I can’t find the checkbox which excludes the bots (using Analytics from Germany).
    I think Google does activate this automatically.. my settings page does look differently either.

    Also, I think bots using headless browsers like the Chrome version or PhantomJS (which can have modified properties like screen height and width, user agent value,..) cannot be excluded.
    Regards

  • Sayf Sharif

    Not sure why you’re not seeing it in Germany. I don’t recall it being region specific. Are you sure you’re looking in the right area of the admin?

    Regarding headless browsers, it can occasionally be hard to detect them, particularly when they spoof user agents, or come from a variety of locations, etc. Anything we can do to get rid of this traffic automatically the better.

  • http://www.wg2k.com tania

    Any advice for blocking semalt.semalt.com? This bot filter has made no difference in them skewing my data.

  • Sayf Sharif

    Tania,

    Primarily, if you’re sure that all the traffic from a single service provider is a bot, you should filter them out. Create an exclusion filter on ISP Organization, and name the bad service provider to remove them from your reports.

  • http://www.stevenfenech.com/ Steven Fenech

    Didn’t know about that check box. Thanks 🙂

  • Liam McArthur

    My brand new website now has a bounce rate of 100% due to having no organic visits, but a load of visits from Semalt, which was a pain, however – I just deleted the profile, re-created and blocked semalt.com from crawling my website. I’ve never had problems with any other bots crawling websites and causing my data inaccuracies.

  • Davide

    Grat post Sayf!!
    I manage a Google AdWords campaing for a client and, of course his Google Analytics account. I have the same problem with this type of bot, same session from Russia and Samara, the strange fact is that there isn’t any Analytics code in that site! I created one, but there isn’t still paste in the site. I don’t know how is it possible.
    Could be a good idea create another Analytics tracking code to insert in that site?
    Thanks! 🙂

  • Denise

    I having problems with the semalt & buttons-for-website on my analytics report. I have tried exclusion filters, but it hasn’t helped. they still find their way in. I am using a google blog & was told that you could not change the httc on a google blog (whatever that means) so I am unable to use the code to block them. I decided to check the box to exclude all known hits from bots & spiders, but will that keep google from crawling my blog? that’s how my clients find me.

  • http://aimforsimplicity.com Krzysztof Jendrzyca

    This checkbox should be checked by default.

    • Sayf Sharif

      To a certain extent I agree, but I also like that by default a property and view are absent and unfiltered and leave it to you to add what you need. Google Analytics could be better at making basic recommendations perhaps, and giving you the option to do it “Advanced” which starts sans everything.

  • http://traffic-bots.com/ unuro

    The Bots are the software applications that interprets the work of a human operator. – http://traffic-bots.com

  • http://www.twitter.com/daveculbertson Dave Culbertson

    Did you get this issue resolved? I’ve successfully implemented filters that block both.

    • El-The Beauty Isle

      Can you tell us how? I have this problem too and the Filters in GA don’t seem to be working on some of them. Also, is it normal to need to filter a new ghost referrer nearly everyday??

      • http://www.twitter.com/daveculbertson Dave Culbertson

        While they can’t all be caught immediately, many can be eliminated by setting up a hostname filter that limits recorded visits to your domain. In addition to turning on GA’s bot filtering, you can also add a filter to capture most of the spam visits that aren’t caught in bot filter and hostname filter. The best write-up on the Web is here:

        http://www.analyticsedge.com/2014/12/removing-referral-spam-google-analytics/

  • Troy

    You need to add those domains to Property -> Tracking Info -> Referral Exclusion List. Just remember that this will not remove historical records.

  • Troy

    You need to add that domain to Property -> Tracking Info ->
    Referral Exclusion List. Just remember that this will not remove
    historical records.

  • http://crokes.com/ Shaun

    please suggest me some free traffic bots
    crokes.com

  • http://google.com/+jenniferbailey Jennifer Bailey

    You don’t even say where to find this in Analytics. Kind of a useless article then.

    • Sayf Sharif

      Sorry for that confusion. It’s shown in the screenshot of the View Settings within the admin panel. It must be enabled on each view individually.

  • Alonso Rodriguez

    Getting hit with bots create 100% bounce rate, which in term
    lower your google page rank, organic traffic, and eventually organic position. My question
    is, why would you want to hide the bots with a “google Checkbox” when you know the
    problem is still there? That’s like putting a bandied over a large wound. I would
    understand the use of this “Google Checkbox” for hiding bots if you are running
    PPC on your website, but what if you are not running PPC and your
    business depends solely on organic traffic? Don’t get me wrong. I’m all for the
    quick fix, but nobody seems to be talking about the negative effects bad bots have on
    organic positioning, or I think. I don’t really know. I’m in the same boat as
    everyone in here. Can someone help me out here? http://employeelawca.com/ Thanks, sorry about the link, I drop links Subconsciously.

    • Sayf Sharif

      From our angle, we’re not looking to eliminate bots hitting your site, simply to give people clean data they can make business decisions with. There are other people out there looking to eliminate the malicious bots, and I support them 100%, but it’s not what LunaMetrics does.

      Our goal, until we can get someone else to eliminate the malicious bots, is to give people data they can use to make a decision on. You should always maintain an unfiltered view with all your data, and even make a ‘bots’ view to just see what that traffic is doing, but the key thing for us is to eliminate it from the main views people use to make decisions on digital marketing. Bot traffic can be significant enough to affect conversion rates, and make you think things are working better or worse than they actually are. Our goal is to try and make that data as accurate as possible until someone else figures out how to block the other malicious bots from ever crawling our websites.

      • Henry Ollie

        Hi Sayf- Is Alonso right about bots still impacting your rankings and organic traffic with this filter?
        I tried the filter yesterday and the bots are not showing up. I was getting hit hard twice a day by this fiends and they leave behind a 100% bounce rate and lower avg. time on my website. I have tried the guidance for modifying htaccess and robots.txt/ The bots went from being spam referrers to hitting my website directly. My can’t the server resource (I use LunarPages) put up a firewall using the IAB list?

        • Sayf Sharif

          Not really in that regard. Bounce rate IS a small factor in organic traffic, but the key part there is that the traffic needs to actually have visited through a Google search in the first place, clicked on the link on a Google search, and then bounced back relatively quickly to the same search. So in theory a bot that loaded a google search page, and then clicked a link on a google search, hit your site, and quickly reloaded the google search page, would affect your bounce rate, but Google only looks at those numbers in large aggregate, and it’s not a major factor in your organic rank.

          Just a bot hitting your site from somewhere tanking your Google Analytics bounce rate will not affect your Google Search ranking. How could it? Direct hits onto your website, maybe recorded in Google Analytics, but Google Search isn’t reading your Google Analytics data to come up with their rankings.

          If you did have some sort of weird malicious bots trying to lower your search engine ranking, it would need to spoof a browser, hit the Google Search engine with enough validity that Google would think it’s a person, then do a search to bring up your website, hit your site, bounce back, and then maybe hit another site, and linger there…. Then clear all cookies, retreat back, and tunnel back to Google using an entirely new IP address so the search engine wouldn’t think it was the same exact person returning. I guess in theory it could be done, but if you really think that someone is doing that to your traffic then my money would be on you being a little bit of a nutter, rather than it actually occuring.

          The bots are mostly bad because they screw up your data, and make metrics like bounce rate in your GA difficult to use, but 99.999% of them won’t muck with your organic search ranking just by doing direct 1 hit visits on random pages on your site.

  • Shizuppy

    Checking the box still doesn’t filter out google’s own bot. I see a ton of hits from “google inc.” and have had to manually filter them out.

  • Daria Leanne

    Do bots always visit on a schedule of their own, or can they show up only when you post? Like, if I have a little spike in the 5-20 minutes after I post something to my blog, could that be bots or is that people who are getting notifications? (I just started blogging and don’t get much traffic yet, so a handful of visits in less than an hour is a “spike” for me.)

    • Sayf Sharif

      They can do both. There are bots that will detect when you post new content via the RSS feed or something, and then they’ll come and scrape and you’ll detect them. It can also be people who are following your stuff, but if it’s consistently right after you post, no matter the time of day. If you’re sharing your posts it’s probably people, but if you’re not sharing the posts on social media at all, then try posting in the middle of the night or something… if you still get that spike it’s likely bots.

  • Angela Martinez

    Hi Sayf, thanks for this article. Does that filter block Googlebots? I doubt it would but I just discovered Google WebMasters and apparently it doesn’t work if Googebot is blocked.

  • DasBoot66

    I was adding posts for my blog to GPlus after reading it was the right thing to do. I immediately saw bots in GA. If this was good or bad I have no idea. LOL.

  • saystoptospam

    Hi,

    Great post!

    I would to mention a tool we just released: http://www.saystoptospam.org/

    This is an easy way to share the spam referrers you blocked in our reports.

    We use them to publish, every week, an updated Google Analytics segment to be applied on your reports.

    It’s totally free.

    Could you spread the word on one of your social accounts or on your blog?

    Thanks

  • Saiful Islam

    Its a nice post. Recently I have noticed some unwanted urls affected my Google Analytic result by increasing bounce rate. I am in a great worry. Now I feel relax. I’ll try it out. Thanks to share.

  • http://WWW.TEDIPOST.COM tedipost

    To know more about these referrer spam here are some articles

    Blogging Tips For Removing Referrer Spam Using Google Analytics

    To know about Bot programs:-

    Bot is Bad or Good Program – Blogging Tips About Bot

    To know about what is Referrer Spam and faq about referrer spam:-

    Referrer Spam is not a real traffic – Blogging Tips About Referrer Spam

  • John

    This definitely is not even close to a solution. It does not work. Not even close.

Contact Us.

LunaMetrics

24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.