Querying Google Analytics Data in BigQuery | LunaMetrics

Querying Google Analytics Data in BigQuery

/

blog-query-ga-data-bigquery-tinypng
BigQuery, a database designed to query massive datasets in parallel using an SQL-like language, is a member of the Google Cloud Platform. What makes BigQuery interesting for Google Analytics users, specifically Premium customers, is that Google can dump raw Google Analytics data into BigQuery daily. While this enables many types of analysis that can’t be performed within the Google Analytics interface, it also doesn’t provide any basic metrics, e.g. bounce rate, to use.

How to Query BigQuery

The BigQuery web interface is a convenient way to test queries. Selecting the google.com:analytics-bigquery:LondonCycleHelmet dataset from the left pane will show two tables, ga_sessions_20130910 and refunds_201309. The ga_sessions_ prefix is given to all Google Analytics data dumps, and the 20130910 suffix is the date of the data in the dump. The refunds table is application specific and not generated by Google. (We’ll deal with the refunds table in another post!)

BigQueryEditor

Clicking on either table will show its schema in the main pane. Clicking the “Query Table” button in the top right corner will present the code editor. Clicking fields will place them in the query editor. Sample queries are placed in this post to help you get started.

How is Data Stored in BigQuery

Each row in the Google Analytics BigQuery dump represents a single session and contains many fields, some of which can be repeated and nested, such as the hits, which contains a repeated set of fields within it representing the page views and events during the session, and custom dimensions, which is a single, repeated field . (This is one of the main differences between BigQuery and a normal database.)

The Export Schema contains many interesting fields and all of the raw data Google collects. This raw data can be used to recreate all of the metrics and reports you’ve come to love in Google Analytics (Although they may not be as pretty!), in addition to many other interesting metrics, such as the likelihood a bounced user’s next session won’t bounce.

Quick Introduction to SQL

BQSQL is a fairly readable language (and very similar to ANSI SQL found in other databases like PostgreSQL and MySQL) and you should be able to understand and reason about it. A quick and dirty way to remember the main (BigQuery SQL) BQSQL clauses is the following sentence.

SELECT some fields

FROM some datasource

WHERE the row is interesting

GROUP BY some fields I want to be unique

HAVING some aggregate I find interesting

ORDER BY some fields

LIMIT to the top so many

Throughout this and later blog posts, I will help familiarize the reader with BQSQL through actual usage. Similar to learning a foreign language through immersion, I’ll try to explain complexity as it arises and hope that the reader will “pick up” feel of the language as we go.

When doing analytics work, we often want to perform aggregate queries. These are queries that aggregate multiple rows of data into a single value. For example, think count the number of sessions generated by organic searches: we don’t want to know anything about each individual session, we just want the total count. Similarly for average revenue. A list of aggregate functions can be found with the BigQuery SQL documentation.

Example: Count the Number of Sessions Generated by Organic Searches

Note, for the examples in this articles will be using the example “London Cycle Helmet” dataset provided by Google. It is located in the google.com:analytics-bigquery project and the LondonCycleHelmet dataset.

By default, aggregate queries will aggregate over the entire dataset. A simple COUNT can be executed on any field and will count non-NULL fields. The WHERE clause allows you to define conditions to filter your data with.

Here, we only want to count sessions where the traffic source’s medium was “organic.”

Example: Bounce Rate

Sessions that have bounced are marked in the totals.bounces field with a 1 and sessions that didn’t bounce with a NULL. The bounce rate is the total number of sessions that bounced divided by the total number of sessions. While the AVERAGE function sounds like it would be what we’re looking for, it (like all aggregates) ignores nulls which would make the average 1! Instead, we need to simply COUNT the bounces and total sessions. (Since COUNT also ignores NULL it only counts the bounces.)

Here, we COUNT the bounces and total sessions, divide those values, and then name the result of the calculation bounceRate. The fields used come FROM the sample table.

Example: Bounce Rate by Day or Hour

We could GROUP BY date and compute the bounce rate for each day instead of over all tables. (Sad note: the London Cycle Helmet dataset is only a single day. We will instead compute the bounce rate per hour instead.)

Here, we use SEC_TO_TIMESTAMP to convert the visitStartTime field, which stores the time the session started as a UNIX Timestamp, to a type that the rest of the date-time functions in BigQuery, here HOUR, understand.

Also note that we moved the bounceRate calculation onto a new line and indented it a little. This is purely for aesthetic reasons and all whitespace (spaces and line breaks) are collapsed into a single whitespace: this means you can put statements on new lines, indent as needed, etc to make your code readable and understandable. More over, you can use -- to add comments to code as such. Any after a -- and on the same line is treated as a comment and ignored.

Commenting every line is often unnecessary, but well placed and informative comments will save a lot of time trying to figure out why a query was written the way it was originally. A rule of thumb I use is that if I needed to spend time thinking of how to write a part of a query, that immediately gets a comment about why it was written the way it was. Another rule of thumb is if I wrote a query, found it to be incorrect and had to fix it because of something I wasn’t expecting or thought of, then it also gets a comment. (Think about AVERAGE ignoring NULL. I know that now, but a new person reading my query later may not.)

Additionally, the capitalization used above (keywords in all caps and fields in CamelCase is only a convention to make reading easier. Capitalization doesn’t matter, but adhering to the convention is highly recommended and encouraged.

Example: Bounce Rates by Traffic Source Medium

Knowing that GROUP BY works on any field, or group of fields, in the table or defined in the SELECT clause is very powerful, especially when thinking about “some statistic” per “some value in a grouping”.

Now, let’s get crazy and group by multiple fields!

“But!” you’re going to tell me, “they’re all in a random order!” and you would be correct! We need to ORDER BY some field. Let’s do sessionHour first.

Example: Mediums with the Most Sessions per Hour

Now that you have the crazy juices flowing and feel like you can conquer any problem with just enough SQL, you begin to think “Hey, which mediums bring in the most people by hour!”

Going Further

You might begin to see the types of questions that could be answered with BigQuery that you can’t in Google Analytics, for instance more detailed drill downs into user-based metrics. We’ll save a discussion of those for our next post however!

In addition to future blog posts on using BigQuery and other tools to analyze your data, the book Practical Google Analytics and Google Tag Manager for Developers recently published by Jonathan Weber and the rest of the team here has an entire chapter on learning and using BigQuery with Google Analytics! (In addition to a plethora of good information and best practices when using Google Tag Manager.)

Jim is a Data Architect at LunaMetrics. He is a rail fan with a MS in Civil Engineering and BS in Mathematics, both from the University of Pittsburgh, and has worked with software at local companies and the Pittsburgh Supercomputing Center. When not train-watching or extracting information from data, he works on software projects and is a member at a maker space and public transit advocacy group.

Contact Us.

LunaMetrics

24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.