Data Processing Options for Google Analytics and BigQuery Export/
July 28, 2014
In this blog post, I evaluate several of the numerous (and potentially overwhelming) options for the processing and reporting of Google Analytics data. The default Google Analytics web interface is great for quick ad hoc data exploration, but limited for deeper analysis and the development of automated reports.
Whether we’re mining for hidden trends or trying to report on hard-to-extract dimensions, there are a number of third-party tools out there can that help ease the burden.
In the first half of this article, I explain the difference between the two types of Google Analytics data: what’s available from the standard interface and what’s available through the BigQuery export.
The second half of this article is an evaluation of three different solutions for processing, visualizing, and reporting on Google Analytics/GA BigQuery data. I evaluate these three solutions (ShufflePoint, Tableau, and R) based on objective features and my subjective scoring of performance.
I only evaluate three data processing solutions in this article. Think I missed a good one? Let me know!! We all have a different background in data analysis tools, and I would love this conversation to continue in the comments section.
1. Google Analytics Data vs. BigQuery Export for Google Analytics
There are two types of Google Analytics data. First, the standard type that is available through the interface and the API, and which we’ll refer to as “summary data”. The second, available only to Premium users through the BigQuery Export for Google Analytics feature, is hit-level/session-level data . We refer to this as “granular data”.
“You can export your session and hit data from a Google Analytics Premium account into Google BigQuery, so you can run queries using a SQL-like language.”
You can access all of your Analytics Data. Great, but what does that actually mean? Well, if you’re a frequent reader of the LunaMetrics blog, you probably already know the answer!
From colleague Jonathan Weber’s post on BigQuery:
“Because this is session-level data, we can get at the kind of effects you can usually only get at with Advanced Segments in GA (including the new user segments). For example, finding all the visits by users who have viewed Product A. And using the power of BigQuery’s processing engine — voila, no sampling, no matter how big the data set you start with.”
If you’re not a frequent user of Google Analytics or don’t have much experience with the Google Analytics API, you still might be wondering why exactly the session-level and hit-level data export is a big deal.
Comparing the Two
A good way to understand standard Google Analytics data is to think of an Excel Pivot table. A pivot table is a data processing tool that provides a summary of the individual data points, sorted by a set of dimensions.
Say we have 100 sessions on our website, and we want to look at pageviews based on the browser and state (region) of those sessions. The limitation of standard Google Analytics data (through the interface and standard Data Export) is that we can only view the completed pivot table. i.e. we can view the summary (average or sum) numbers for metrics like pageviews per session, session duration, etc., but we cannot view the actual data that was used to create the table.
The standard reports from Google Analytics look like completed Pivot Tables
Through the BigQuery data export, we’re actually able to pull back the curtain and take a look at each of the rows of data that were summarized by the pivot table above. We get each line of data, and we can then crunch it, pivot it, and process it any way that we desire.
When is Standard Google Analytics Data Sufficient?
Standard Google Analytics data is sufficient for most basic analyses:
- Did bounce rate increase for Internet Explorer users since the homepage upgrade?
- Are users from Oklahoma converting at a lower rate than users from Texas? Should ad spend be shifted accordingly?
- Are women or men consuming more Pageviews (and ad content)? Which age group is increasing fastest among users who made purchases?
Standard Google Analytics data is not sufficient under two conditions:
a. Sampling is triggered by the number of sessions or bucketing is caused by too many rows
- Sampling occurs when the number of sessions in queried timeframe exceeds 250K/500K at the property, not view, level
- Sampling occurs when the number of sessions in the flow visualization report exceeds 100K
- Bucketing occurs in GA interface when there are more than 75,000 rows in a standard report (for a given day)
- Bucketing occurs for data export when there are more than 10,000 rows over requested timeframe
You can read Google’s documentation for more information on Sampling and Bucketed data, which appears in the reports as (other).
b. More-complex analysis is required
- Is there a significant difference in revenue per session between the sessions (visits) from campaign A and campaign B? (Requires standard deviation of revenue per session in each campaign.)
- What is the likelihood that a user (visitor) purchases products A and B together (market basket analysis)? Is this association statistically significant?
- What is the likelihood that a visitor will make another purchase after purchasing product A? Which product has the highest value for this likelihood? (Determine best product on which to offer a promotional discount to new customers).
Now we can see in more detail why Google Analytics BigQuery Export data has capabilities beyond the standard summary data available in the Google Analytics interface and through the standard data export functionality.
In my last post on Google Analytics Data Mining with BigQuery and R I provide further explanation of the types of analyses possible with hit-level and session-level data. (Bonus: if you’re using these tools already, there’s an R script there that will create a pretty cool Ecommerce report from your Google Analytics BigQuery Export data!)
2. Processing Solutions for Google Analytics Data
Now for the second half of this article: which data sources and data processing interfaces are right for my Google Analytics needs?
We have already covered Google Analytics and Google Anaytics Export for BigQuery as the data sources.
A few of the data processing interfaces which we are familiar with here at LunaMetrics are:
Shufflepoint is a data processing and report automation tool that works especially well for bringing data directly into Excel. It also has capabilities for PowerPoint and custom web dashboards. It currently aggregates data from twelve data sources including Google Analytics, Google AdWords, Google BigQuery, YouTube.com, and Salesforce.com. This list is always growing.
We especially love ShufflePoint because the customer service is incredible. These analytics platforms that it supports are always changing, and ShufflePoint is extremely quick and proactive in supporting these changes. They also add support for platforms you are using if they are not currently supported.
Tableau provides not only data processing and aggregation, but also extensive data visualizations. Further, the learning curve for Tableau is less steep than other solutions. Along with the power of the visualizations, this ease of use is Tableau’s biggest strength. It feels like working with Excel Pivot Charts on steroids.
The professional version of Tableau connects to some analytics platforms, including Google Analytics, Google BigQuery, and Salesforce.com. The full list is available here: http://www.tableausoftware.com/products/techspecs. Tableau is more focused on ERP systems and Big Data sources. It does not list many of the digital analytics platforms supported by ShufflePoint.
R (using RStudio)
R is a free software programming language designed for statistical computing and graphics. Thanks to the work of independent developers, there are R packages to access Google Analytics and Google Analytics BigQuery Export. This has the steepest learning curves of the three data processing and reporting solutions, but is also the most powerful.
Although R is free, you definitely want to use RStudio for any reporting. RStudio is a freemium model, with pricing plans for enterprise solutions. They also develop a web reporting platform called Shiny which allows you to generate HTML and CSS web reports using only the R programming language.
Assessment of Data Processing Solutions
The table below is my evaluation of these three solutions (ShufflePoint, Tableau, and R) based on objective features and my subjective scoring of performance.
If you have Google Analytics Premium already or are considering getting it and using BigQuery, you should read through Jonathan’s post on BigQuery (as I mentioned above). He provides more detailed explanations on BigQuery, namely:
- Who is eligible for Google Analytics Export for BigQuery
- How you specifically access the data
- How you set up BigQuery
- Cost of running BigQuery
And for even more specifics on the use cases of BigQuery, refer to one of the first posts on BigQuery, by my colleague Dorcas Alexander.