Upcoming LunaMetrics Seminars
Washington DC, Dec 1-5 Los Angeles - Anaheim, Dec 8-12 Pittsburgh, Jan 12-16 Boston, Jan 12-16

Using ALL of Google Webmaster Tools data to tackle (not provided)

Note: The code provided in this article has been updated, and is now provided as a bookmarklet at the link below. This article has not been redirected, because it includes analysis not provided in the new article. Click here for the bookmarklet

Raptor too excited and falls

My reaction to the GWT New Year’s Update

I couldn’t believe it when I saw the January 7, 2014th Webmaster Tools update,

“data in the search queries feature will no longer be rounded / bucketed.”

At first I thought, why would Google go through all that trouble to obfuscate keyword data in Google Analytics, when they planned on handing all that data back through the search query reports in Webmaster Tools? And of course, they didn’t plan on anything of the sort. The relatively minor update only removes bucketing, and does not address the big issue, that they display only 20% to 25% of search query data. I held out hope that, as it appears in the before and after pictures, the sampling rate had been increased from around 20% to around 35%. But while I’ve noticed small changes in some accounts, it does not appear they’ve made this improvement.

webmaster tools graph before update

Webmaster Tools before January 7th, 2014 Update

 

webmaster tools graph after update

Webmaster Tools after January 7th, 2014 Update

So, how much of a boon IS the retraction of bucketing in GWT’s search queries? There definitely isn’t anyone complaining. It’s great to no longer see “<10 clicks” for our long tail queries. Of course, the biggest cost of (not provided) to the digital marketing community is the new-found powerlessness to relate search intent with landing page and overall site performance. While much energy and creativity is channeled towards addressing this issue with third party tools, I believe there is yet untapped insight inside Google Webmaster Tools.
 

Patch Analytics with Webmaster Tools

Before we get into the scope of this article, it is worth a shout out to Ben Goodsell who came up with a nice way to beat the bucketing over a year ago. Now that we no longer have to worry about bucketing, we can use an easier variation of his method to combat (not provided). After downloading the organic keyword data from Google Analytics and the search query data from Google Webmaster Tools, I used the latter (now accurate) data to correct the former. I won’t go into the details of my Excel setup, but I included a screenshot below. I can post the setup if there is interest. In this case, we went from 2283 visits with defined keywords in GA to 6802, using the GWT data. Of course when you only start with 4% of your organic visits as not (not provided), a 198%  increase is not as impressive. Still, it is better than nothing.
 
Combining GWT Search Query Data with GA Keywords
 

Re-connecting Queries with Landing Pages

Short of using Google Tag Manager to import optimized keywords to your Google Analytics (which everyone should also do, by the way) Webmaster Tools still provides the last in-house way of connecting search queries with your site content. Below is the Search Query->Top Pages report from GWT    Top Page Report in GWT

The Top Pages Report in GWT

Notice the number of clicks, circled in green. When I first saw this, I did another impression of the Toronto Raptor, thinking I had discovered a loophole in GWT’s sampling methods. But of course, the ‘displaying 136,552 of 153,511 clicks’ means that nearly 90% of clicks are accounted for in terms of landing page. When you drill down into Keyword by Page, observe that only the standard 20% to 30% of search queries are accounted for. Still pretty neat, though, huh? You can now get an (exact) number of clicks for a given page for any search queries that made it past Google’s sampling method. What could we do with that data? Well it would be great to export it, play around with it, and see what types of additional insights we can draw. Which brings us to the next point of our review of GWT.
 

Poor API Support!

The only part of Google Webmaster tools as frustrating as the (former) bucketing and (ongoing) sampling, is the lack of official API support. There is a an official Java API that cannot return search query data; only crawled keywords,crawl issues, etc. And the unofficial APIs that I have seen (PHP and Python) do not support easy OAuth integration, and have only limited support for search queries. Even the Google Analytics integration is lacking. The search query data cannot be combined with any meaningful GA metric, and, to make things worse, the imported data is still being bucketed! So, to access the Search Queries->Top Pages report without any heavy coding, we need to use the GWT user interface.
 

Search Queries->Top Pages Export

Unlike the standard Top Queries report, we cannot export the complete Top Pages report via the UI. The best we can do is export the summarial table with a breakdown only by pages (and not search queries). We could also technically scroll down  the page, expanding each of the links by hand, but that would be painful. I wrote a couple JavaScript functions to automate the process. The code is rough, but it does download ‘page’, ‘search query’,  and ‘clicks’ columns for each search query entry, in TSV format for Excel. The code is available from GitHub, and is also included below. I have only used it in Chrome. Exporting Top Page Reports  
 
Steps to export your Search Query->Top Page report from Google Webmaster Tools:

  1.  Log into GWT and navigate to Search Traffic->Search Queries->Top Pages.
  2.  Set the grid.s=25 parameter in the URL to however many pages you want to download. You should also order the pages by clicks if you are downloading less than the maximum number of rows.
  3.  https://www.google.com/webmasters/tools/top-search-queries? hl=en&siteUrl=http://www.yoursite.com/&de=20140121&db=20131101&more=true&qv=amount &type=urls&prop=WEB&region&grid.s=719 

  4. Set your desired date range. Up to three months prior is available in GWT. As a side note, it might be a good idea to backup your data every three months.
  5. Press F12 to open the JavaScript Developer Tools. Select ‘Console’
  6. First, copy and paste the below JavaScript code into the Developer Tools console. Hit enter. You will be presented with an alert for each page entry in the table that Google is unable to expand. Simply hit enter to cycle through the alerts. When it appears all alerts are done, and all the page entries that Google can access have been expanded, proceed to the next step.
  7. //expand page entries
    (function(){
       pages = document.getElementsByClassName('goog-inline-block url-detail');
       for(i=0;i<pages.length;i++){
          pages[i].setAttribute('href','#');
          pages[i].setAttribute('target','');
          pages[i].click(); 
       } 
    })();
    

     

  8. Second, copy and paste the below JavaScript into the Developer Tools console. Hit enter. As long as your pop-ups are not disabled, you will be prompted to download a TSV with your GWT search queries->page data.
//generate download link

(function(){

  //make index for page rows
  //getting page rows separate from query rows
  //ordering them, storying 2-item array for
  //each page row, page path and index in
  //table
  
  temp = document.getElementById('grid').children[1].children;
  indices = new Array();
  tableEntries = Array.prototype.slice.call( temp)
  pageTds = document.getElementsByClassName('url-expand-open');
  
  for(i=0;i<pageTds.length;i++){
    temp = tableEntries.indexOf(pageTds[i]);
    indices.push([temp,pageTds[i].children[0].children[0].text]);
  }
  
  pageTds = document.getElementsByClassName('url-expand-closed');
  
  for(i=0;i< pageTds.length;i++){
    temp = tableEntries.indexOf(pageTds[i]);
    indices.push([temp,pageTds[i].children[0].children[0].text]);
  }
  
  indices.sort(function(a,b){return a[0]-b[0]});
  
  // this is complicated. need to mess with with index of
  // table rows since the aggregate page listing
  // is row just like expanded query rows
  for(i=indices.length-1;i> 0;i--){
    test = indices[i][0]-indices[i-1][0];
    if(test===1){
     indices[i-1][1]=indices[i][1];
     indices[i][0]++;
    }
  }
  
  thisCSV = "page\tkeyword\timpressions\tclicks\n";
  queries = document.getElementsByClassName("url-detail-row");
  
  //use count to know when to update the page
  //column for the TSV. sorry if convoluted,
  //did this quickly not elegantly
  count = 0;
  
  for(i=0;i<queries.length;i++){
    if(indices[count][0]===i){
      thisPage = indices[count][1];
    
      do {
      count++;
      test = indices[count][0]-indices[count-1][0];
      } while(test === 1);
  
    indices[count][0]-=(count); 
    //because the pages and keywords are all in
    //tags, and were counted as the same level in the index
    //before
    }
  
    thisCSV += thisPage+"\t";
    l = queries[i].children[0].children.length
    
    if(l > 0) thisCSV+= queries[i].children[0].children[0].text+"\t";
      else thisCSV+= queries[i].children[0].innerHTML+"\t";
    
thisCSV += queries[i].children[1].children[0].innerHTML+"\t";    
thisCSV += queries[i].children[3].children[0].innerHTML+"\n";
  
  }
  
  
  //create href and click it as means to save tsv
  encodedUri = "data:text/csv;charset=utf-8,"+encodeURI(thisCSV);
  link = document.createElement("a");
  link.setAttribute("href", encodedUri);
  
  //update name w timestamp if you want
  link.setAttribute("download", "GWT_data.tsv");
  link.click();
})();

 

Delving into the Data

Now that we’ve downloaded the data, let’s talk about what we can do with it. Why did we even download it in the first place? Well, as we mentioned in step 3, GWT data is only available for the past three months. If you regularly backup your data, you will have access to more than three months, and may be able to conduct better keyword analysis. In addition to maintaining historical data, we may be able to glean insight by sorting it and comparing to other data sets. I’ll outline how I used Excel for such a project. My approach was to increase the proportion of total data accounted for by the data displayed in Google Webmaster tools, based on the following assumption.

Assumption:
the process by which Google filters (chooses which queries are displayed in GWT) is not dependent on the keywords themselves. In other words, while Google might, for example, tend to display less long-tail keywords to us, they are not always blocking the same keywords on a weekly or monthly basis. If the above assumption holds true, we can partition data into weekly or monthly segments, and then estimate clicks for queries that appear in some time segments, but not in others. This technique would be likely be safer when working with monthly data, as there is a better chance the above assumption is met. For sake of demonstration, I download the last three months’ Search Query->Top Pages data  and partition it into six two-week segments. After importing into excel, I create a master list, previewed below.

pivot table of GWT page-level search queries

Exported TSV of GWT page-level search queries

The fourth column is an index that represents the the two-week time period. Next I create a pivot chart with the data, and I am able to display a chart with query parameters as rows and the two-week time periods as columns. The values listed as visits are actually clicks. This method is most applicable to the search queries with a medium-level of clicks. These queries are common enough that they can be expected to be searched every two-weeks or month, but not so common that they need to be regularly included in the GWT reports (or else be conspicuously absent).

Pivot Chart of Page-Level Search Queries, with Data Filled-In

Left: Pivot Charts of Page-Level Search Queries. Right: With Missing Clicks Estimated


 

Results

Using this method, I’ve accounted for 13% more clicks (visits) without introducing new keywords. Further, I’ve only used:

  1. three months of search query data, and
  2. a small website with
  3. quickly changing web pages (the vast majority of landing pages are blog articles).

This method will be even more useful for:

  1. Those with more than three months historic data
  2. larger websites
  3. websites with more-static web pages.

 

Extensions

  1. Scale the monthly estimated corrections using aggregate search volume data. This will help to determine whether the absence of a search query is due to filtering by Google or just a lack of interest by searchers.
  2. Use Dimension Widening to import the refined search query data into Google Analytics, associating it with the landing page dimension.

 

Assumptions Revisited

I had reasoned that between the two-week periods, there are keywords that are sometimes displayed to us in Google Webmaster Tools, and that are sometimes blocked. For any search queries where some two-week periods have zero clicks, I average how many clicks/two-weeks they received over the three-month period, and assign that value to the given query. While there are certainly legitimate cases where a search query had no clicks for a given week, I reason that the error of wrongly assigning a search query click to a given page is less than the gain netted in terms of better understanding our search queries (and on a page-by-page basis at that!)

And what if Google is consistently hiding the same keywords over the three-month period? I would argue that this would be very hard for Google to achieve while still displaying a relatively consistent percentage of total queries. (what happens if site traffic drops so much on a small web site that Google would be forced to display more than 20% or 30% of keywords?) They probably need to display keywords that have before been classified as hidden, even if they do not admit it.

Noah Haibach

About Noah Haibach

Noah Haibach is a data analyst at LunaMetrics. Working with Google Analytics provides Noah with more data than his wildest dreams (or even nightmares sometimes), enabling him to pursue his passion of data mining and predictive analytics. Noah wasn't always working in the tech field. His background includes math and biology research. He also has an affinity for languages, racquet sports, and brewing.

http://www.lunametrics.com/blog/2014/01/23/google-webmaster-tools-data-not-provided/

27 Responses to “Using ALL of Google Webmaster Tools data to tackle (not provided)”

Anna N. says:

Great post! Can you provide more details how to use the Dimension Widening and import the refined search query data into Google Analytics? Thanks.

Noah Haibach Noah Haibach says:

Thank you Anna!

I probably should have expanded on what I meant by using dimension widening. We of course have no way of matching organic search visits with their actual search queries. What we might do, however, is probabilistically assign search queries to visits based on landing page. I admit the process is a bit tricky, and requires more initial setup in excel. Here are the steps as I see them:

1. Download the GWT Top Pages report as I outlined in the article.
2. Create a new view for your property in GA.
3. Set a visit-level custom dimension to serve as a key for the dimension widening. Its value should be set to a random number between 1 and n where n is larger than the largest number of organic visits for a landing page displayed (in GWT) for a two week period (for us, it might be 500-1000). Use Math.random() to set this value.
4. Set up dimension widening in GA based on two keys. The first key would join on landing page, and the second would join on the random value you set with the previous step.
5. Use the GWT Top Pages report downloaded in step 1. Use the prior two weeks’ search queries, and distribute the random numbers (from 1 to n) so they represent the distribution of visits among the search queries. Do this for each page.
6. Generate a CSV from the updated GWT Top Pages report. This CSV will have to be updated every two weeks, or as quickly as search trends change for your content.
7. The Dimension Widening, along with a filter, could be used to rewrite all (not provided) entries as a custom dimension ‘keyword or estimated if not provided’.

The initial setup is a bit complicated for this method. And there would be upkeep involved to update the Dimension Widening CSV every two weeks or month or so. While it does not reassign ACTUAL search queries to their respective visits, it could provide a more granular understanding of search intent than current landing page reports. It would be especially helpful for large websites that are less impacted by the error involved with our estimations. Let me know if I can provide more specifics or clarification.

Also, we might wish to simplify the data we downloaded from GWT by removing stop words and grouping similar search queries (before using it for GA dimension widening).

Noah

Chris says:

Very clever stuff.

If I understood correctly, the crux is:

If there’s a time-frame (eg two week period) when a keyword doesn’t appear, and it usually does, then add the keyword with an estimated value (instead of no keyword).

For this method, how many two week periods in a row can a keyword be missing and still get an estimated value assigned to it?

As the data set gets bigger you’d need to account for this or you may be adding keywords (eg seasonal keywords) when they don’t exist!

Noah Haibach Noah Haibach says:

Thanks Chris, and great question!

To answer your second question first, yes, you definitely need to consider the seasonality of your site/the keyword. I think I mentioned as possible extension, you could try to put certain keywords in context by checking them against Google Web Trends and other services.

Your first question needs a more nuanced answer. How many two-week periods in a row can a keyword be missing and still get an estimated value assigned? We can never be 100% sure; we must always take a probabilistic approach. I’ll give a rough example where we use a non-parametric estimation. Suppose we have four two-week periods and we have a keyword with the following appearances:

weeks 1-2 – 1 query
weeks 3-4 – 0 queries
weeks 5-6 – 2 queries
weeks 7-8 – 1 query

We can thus say that over the 2-month period, the rough chance the keyword appears on any given day is (1+0+2+1)/60 = 1/15. According to the binomial theorem, the probability that we observe no queries for a 2-week period is:

P(x=0 | n=14,p=1/15)=0.38

So there’s a good chance the 0 queries was due to chance. Note that this is by no means an accurate model of the probability (the real deal would be more complicated), but it’s a good enough approach to give us an idea of the probability that 0 queries in a 2-week period is due to chance.

Hey Noah, just released a new product you might like — it downloads all that data you can’t get at through the non-existent API using a simple Excel add-in (Windows only, sorry). First release…more to come…

Tommy says:

Hi Noah,

I have been trying and trying to get this code to download via Chrome, but so far no luck. The first part of the code, when it runs, returns undefined. Yet it looks good in Webmaster Tools. However, when I run the second part of the code to generate a download link, it is returned with an error of “TypeError: Cannot read property ‘0’ of undefined” – any advice? Thank you!!!!

Jimmy says:

hi, I have the same problam as Tommy.

Adam says:

Hi,

Noah/Mike

I had the same problem as you had when copying the script straight from the article into my browser. But when I copied it to a notepad first it worked fine.

Hope this helps

Adam

Hi,

It worked for me when I refreshed the page.

Anyways, is there any way to include impressions in the tsv file?

Thanks so much for this Noah. A great contribution to SEO in today’s (not provided) world.

Regards,
Gerhard

cedric says:

Very Great. THX a lot for this formulas.
May you put some settimeout in your first function ?
Because, google send 403 if you tried to unfold to many urls. And I have more than 2000 urls…

Anthony says:

Hi, I’m getting impressions but no clicks.

Is there a way to add this in?

Thanks

Noah Haibach Noah Haibach says:

Hey everyone, thanks for the positive feedback.

@Anthony and Gerhard, I’ve updated the second script so that it now downloads impressions in addition to clicks.

@Tommy, Jimmy I’ve had trouble reproducing that error. But I have heard of it occurring to others as well. I’ll let you know if I figure it out. In the meantime, maybe you could try what Gerhard and Adam suggest. And you are running it in Chrome, correct?

@Cedric, try the code I’ve included below in lieu of the first JavaScript function. Set the timeout1 variable to your desired pause. It is currently set to 50ms:
 
 


//set timeout and expand page entries
(function(){
var timeout1 = 50 //set number milliseconds between requests
var k = 0;
var pages = document.getElementsByClassName('goog-inline-block url-detail');
var currInterval = setInterval(function(){
expandPageListing(pages[k]);
k++;
if(k>=pages.length) clearInterval(currInterval);
},timeout1);
function expandPageListing(page) {
page.setAttribute('href','#');
page.setAttribute('target','');
page.click();
}
})();

Guilherme says:

Noah, you are the man

Guilherme says:

Noah,

By the way, I in code two you have a line with:
thisCSV = “page\tkeyword\timpressions\tclicks\n”

I was running the all code and getting an error.
Therefore I end up removing the “\n” and then it worked.

Not sure if it is a mistake or if that has actually a purpose.

Please let me know

Thanks

Noah Haibach Noah Haibach says:

Hey Guilherme,

Thank you!

The line:
 
thisCSV = "page\tkeyword\timpressions\tclicks\n";
 
only serves to set the column headers for the CSV. It is not important and can be replaced with:
 
thisCSV = "";
 
You won’t get the headers if you use this code instead. I’m really not sure why the \n would give you an error. You can also try to replace the \n with \\n or \r\n. And make sure you have double quotes around the string. New line encoding can be wonky sometimes. But if your solution works too, go for it!

Hi Noah,

This is a great post with loads of useful insight. I really value attributing query data to landing pages. Google does make it hard for us though!

I have used your technique to extract the top pages data, however, I can only seem to download about 40 pages worth, even know I have well over 500 pages expanded in GWT. Any suggestions would be most welcome.

Thanks

Noah Haibach Noah Haibach says:

Hey George,

I’m glad you found my post helpful! So, to confirm your problem, you are able to expand the 500 GWT pages. And when you run the second script, there are no errors, and you get a download. But the TSV file only has 40 pages?

Is that accurate? Approximately how many total pages do you have in GWT? And how many rows are there in your exported TSV?

Regards,

Noah

Dan says:

I have the same issue as George, the javascript times out and you get 403s on so many of the page expansions so you don’t get anywhere near the number you should. Using the bookmarket it just crashes alltogether

Noah Haibach Noah Haibach says:

Hey Dan, maybe try changing the timeout. When you use this version (via the console), are you using the code snippet from the comments to expand the URL listings? Or the original code in the article. And what browser are you using?

Thanks,
Noah

Yasri says:

Hi Noah,

I tried the time delay and it expanded the pages without any issues.

But when I run the second bit of the code, I am getting this error message on Chrome:

TypeError: Cannot read property ‘0’ of undefined

Can you help?

Santiago says:

Hi Noah, great post! Thanks.

I am getting the “TypeError: Cannot read property ’0′ of undefined” error too. Thank you so much for your time.

Noah Haibach Noah Haibach says:

Hello Yasri and Santiago,

Try using the updated code (available as a bookmarklet) here: http://www.lunametrics.com/blog/2014/04/30/gwt-top-pages-export-bookmarklet/

Thanks,

Noah

Santiago says:

Hi Noah, it works perfectly. Genius. Thanks so much :)

Yasir says:

That’s awesome. It worked perfectly. Thanks for your time !!

Tracy says:

An alternate way to grab the Top Page with keyword queries is below (if you don’t have too many pages with 10+ clicks/keyword data available)

1. Click on the arrows to open up the queries

2. Copy the page control+a

3. Paste the data in Notepad

4. Copy from Notepad and paste into excel.

It copies almost perfectly, the only thing I found wrong was the impressions/clicks for the page were shifted down a row.

Gopi says:

Nice article, but I would like to know how often google visits the site and updates in webmaster tool?

Robert says:

First off Noah, THANK YOU! This script is a game changer! I’ve successfully used it for a couple of projects. AMAZING!

I am now having an issue capturing the data for a particular account. When I run the ‘expand’ script it gets stuck. I am using the updated script and resetting the timeout to 150ms. Still no dice. Each of the landing page toggle on the left start to give the spinning wheel of death and never resolve and expand the enclosed data.

I do get an error when the script first runs. in fact there are 52 errors. All are ‘Invalid JSON String. This appears right when the script first runs and then it seems to try and process the function. But the function never completes and no data is expanded underneath each landing page.

Have you seen this before? Any suggestions.

Thanks again!

Leave a Reply