What’s Missing in the Google Analytics BigQuery Export Schema?

/

whats-missing-in-ga-bigquery-export-schema
Google Analytics (via reports or the API) typically deals with aggregated data, where metrics are already summed and averaged for you, and you can easily request a tabular report of (say) Sessions by Date. On the other hand, the Google Analytics BigQuery Export Schema contains a wealth of raw data.

While the raw data opens up infinite possibilities, it also means that most Google Analytics Metrics and some Dimensions are not included in the export. These pieces of information, such as ga:hasSocialSourceReferral, ga:channelGrouping, ga:daysSinceLastSession, ga:pagePathLevel1, ga:landingPagePath, ga:contentGroupXX, and ga:previousPagePath need to be computed.

Perhaps most importantly, the goals that we’ve configured inside of Google Analytics are not stored in BigQuery and will need to be computed from scratch.

Brief Recap of the BigQuery Schema

BigQuery is a structured, table-based SQL database. In the BigQuery export, each row represents a session. Inside each session is the hit, custom dimensions, and other information about the session and hits. Below is an illustration of some of the fields within the export.

BQ Rows

Note that the session-level custom dimensions hits are repeated within the session and how the hit-level custom dimensions are repeated within each hit; this is one of the special properties of BigQuery: repeated fields.

Also note how the custom dimensions, hits, and totals have named fields within them; this is another one of BigQuery’s special properties: nested records.

Recreating Metrics

Since no metrics are contained within BigQuery, let us first examine methods to compute them. If you’re familiar with Custom Reports or the Google Analytics API, then you’re familiar with the concept of having metrics computed for a group of dimensions. In BigQuery, and SQL at large, that concept translates into aggregates and GROUP BY.

A simple example is the number of sessions per day (metric=ga:sessions, dimension=ga:date). We need to aggregate the sessions in the export by counting them, GROUPed BY date.

date sessions
120130910 63

Filters can be performed with a WHERE clause, when filtering data in the table, or HAVING, when filtering an aggregate value. To count only sessions with transactions, we can filter on totals.transactions.

date sessions
120130910 16

HAVING is similar to a WHERE but works on aggregate values, e.g. metrics. Using the above example, we could find all days with more than 70 sessions. (Since the LondonCycleHelmet dataset is only a single day, the results are immediately useful.)

date sessions
Zero Results returned

Now let us consider a slightly more advanced example: computing the metric ga:percentNewSessions with dimensions of ga:medium, we can aggregate the number of sessions by a count of all and a count of new sessions, GROUPed BY trafficSource.medium.

(We, unfortunately, cannot aggregate by averaging the new sessions flag in the export, because it is set to NULL, instead of 0, if the session is not new; and aggregates ignore null values, i.e. average would return 1.)

trafficSource_medium percentNewSessions
referral 0.5714285714285714
organic 0.4782608695652174
cpc 0.6
(none) 0.7222222222222222

We can also compute multiple metrics at once. Additionally, we are not limited to the 7 dimensions and 10 metrics that the Google Analytics API limits us to.

sourceMedium percentNewSessions bounceRate
technologysauce.com/referral 0.5714285714285714 0.14285714285714285
google/organic 0.4782608695652174 0.21739130434782608
google/cpc 0.6 0.2
(direct)/(none) 0.7222222222222222 0.1111111111111111

Interesting New Metrics

Since we can leverage all of BigQuery against our raw data, we can compute metrics that dont’t exist in Google Analytics or the API, e.g. quantiles or bucketing/binning.

Here we’ll compute the 25%ile, median, and 75%ile of ga:sessionDuration (totals.timeOnSite in BigQuery). We’re also showing of the ability, and common usage pattern, of using a subquery to do a calculation, or convert data for usage later on in the query.

trafficSource_medium sessions firstQuartile mean thirdQuartile
referral 7 7 20 20
organic 23 5 18 18
cpc 15 13 23 23
(none) 18 8 18 18

Now we’ll bucket/bin our data in preparation for a histogram.

trafficSource_medium sessions B0_4 B5_9 B10_14 B15_19 B20_24 B25_29 B30_
referral 7 1 1 0 1 1 1 2
organic 23 5 3 2 2 2 1 8
cpc 15 3 0 1 1 3 1 6
(none) 18 2 4 1 2 1 0 8

Dimensions

Some dimensions are simple to compute, while others require more ingenuity. Often these require subqueries, like we have seen earlier.

Extracting Dimensions from Other Fields

ga:pagePathLevel1 is an example of an easy-to-extract dimension. It is the first segment of the ga:pagePath (hits.page.pagePath in BigQuery), which we can pull out using SPLIT and NTH.

hits_page_pagePath pagePathLevel1 pagePathLevel2
/helmets/foldable.html helmets foldable.html
/ null null
/vests/ vests null
/vests/yellow.html vests yellow.html

However, we still need to bring along values we want to use later. For instance, if we wanted to compute the bounce rate per ga:pagePathLevel1, we should bring along totals.bounces.

pagePathLevel1 bounceRate
helmets 0.07
null 0.09
vests 0.03
login.html 0.0

Beyond simple transformations on fields, sometimes we want to create “arbitrary” groupings. In the case of Content Groupings that can be based on data already in the export (e.g. URL or Custom Dimension), we can tease it out via a CASE statement, among other ways.

hits_page_pagePath contentGroup1 contentGroup2
/helmets/foldable.html Products Foldable
/ RestOfSite null
/vests/ Products null
/vests/yellow.html Products Color
/vests/orange.html Products Color
/login.html RestOfSite null
/basket.html Checkout null

Slightly More Interesting Example

ga:landingPagePath is similarly easy to find using row-scoped aggregates; what that means is that for each row, we’re going to compute an aggregate based on a repeated value, in this case the hits.page.pagePath and hits.page.type. Since a session need not start with a pageview, we can’t simply grab all the hits with a hit.hitNumber equal to 1. To solve this problem, we’ll take the first hits.page.pagePath where hits.page.type is “PAGE”.

fullVisitorId visitId landingPagePath
380066991751227408 1378805776 /helmets/foldable.html
712553853382222331 1378804218 /vests/
881288060286722202 1378803865 /helmets/
881288060286722202 1378804975 /

More Complex Example

For a metric like ga:daysSinceLastSession that requires knowledge of more than one session, in this case, a user’s previous session, we can use windowing functions.

Windowing functions allow us to compute a value for the current row given the value of other rows in the “window” of data we’re looking at. In the following example, LAG returns the row before the row being looked at as defined by PARTITION BY fullvisitorid ORDER BY visitStartTime ASC. PARTITION BY acts much like a GROUP BY clause, in that it creates new windows for each unique value of the fields listed. ORDER BY sorts the window according to the fields and order listed.

We can read LAG(visitStartTime) OVER (PARTITION BY fullvisitorid ORDER BY visitStartTime ASC) as “looking at only the sessions for a given user, ordered by visitStartTime, what was the visitStartTime previous to the the row I’m currently on”.

fullvisitorid visitId daysSinceLastSession minutesSinceLastSession
3960256913998800485 1378805348 null null
4158255675143559999 1378818879 null null
4158255675143559999 1378819831 0 15.0

We are finding the visitStartTime of the previous session chronologically, and combining it with the visitStartTime of the current session to compute the days between the previous and current session. Above prevVisitStartTime will be NULL if there is no previous session, causing DATEDIFF return a NULL as well.

(The LondonCycleHelmet dataset is only of a single day, so we will only see NULL for the first session and 0 for the subsequent ones. That is why I also included minutesSinceLastSession to make the output a little more exciting!)

Even More Complicated Dimensions

Some dimensions, such as ga:channelGrouping and ga:hasSocialSourceReferral, require more computation that you can (or are willing to) write as a SQL statement. For this we leverage User Defined Functions, (UDFs) in BigQuery. UDFs are pieces of JavaScript that run in V8 on the same machine your data is on. For more information see my post, Self-Joins, Windowing, and User Defined Functions in BigQuery, or the BigQuery Documentation. UDFs take in a row of data, and return 0 or more rows.

In our case, we can use them to compute the Default Channel Grouping found in Google Analytics.

UDF:

SQL:

fullVisitorId visitId channelGroup
380066991751227408 1378805776 Referral
712553853382222331 1378804218 Organic
881288060286722202 1378803865 Organic
881288060286722202 1378804975 Organic
881288060286722202 1378805870 Organic
1677140157296205498 1378803386 Paid Search

Note that above doesn’t take into account social referrers, as these are not in the export. A previous post provides a list we can use to build up social refers to check against.

Content Groupings, if too complicated to do with a CASE and/or REGEXP_EXTRACT, can also be computed via UDFs.

Wrap Up

Hopefully this post helps demystify what is and isn’t contained in the BigQuery export and gives enough examples and information on to compute missing dimensions and metrics from the export. BigQuery is an extremely flexible tool that can free your analysis from the constraints of the API, even if everything the API provides isn’t immediately available.

Jim is a Data Architect at LunaMetrics. He is a rail fan with a MS in Civil Engineering and BS in Mathematics, both from the University of Pittsburgh, and has worked with software at local companies and the Pittsburgh Supercomputing Center. When not train-watching or extracting information from data, he works on software projects and is a member at a maker space and public transit advocacy group.

  • zhibo

    Hi, This article is really interesting and helped me a lot.
    Can I translate it into Chinese and publish it on a Chinese quora like site https://zhuanlan.zhihu.com/Google-analytics (www.zhihu.com) under the CC BY-NC-SA license?

  • Albert Malagarriga

    Thanks for this, I had been looking for an explanation for quite a while! I’m trying to set up queries like the ones ‘google spreadsheets analytics addon’ generates, but I’m having quite some trouble learning to query all the nested data and joining it with other tables on some of the values in our custom dimensions. Do you have any articles giving some protips on that?

    • http://lunametrics.com/ James Keener

      When JOINing on a repeated field, you’ll need to FLATTEN the table explicitly, e.g. FLATTEN(table, customDimensions). These can be nested, if say you wanted to join on a session and hit-level custom dimension, e.g. FLATTEN(FLATTEN(table, customDimensions), hits.customDimensions). BE WARNED! This will greatly increase the amount of data that will be processed by a query like this.

      • Albert Malagarriga

        Yeah, thats how we have been using FLATTEN until now, but its too slow to collect the information we want from users doing it like that… Do you guys have any interestin article to read more about FLATTEN? I’ve been looking for more stuff on it. Thanks a lot!

  • Mariana Alves

    Hi! Thanks for this! I´m trying use regex in UDF but my querie failed. The error is: Cannot read property ‘match’ of undefined at line 9, columns 28-29

    The line 9 is:
    && !row.Campaign.match(/pqbdg|BRD_Netshoes|Pesquisa_Netshoes|Google_cpc_conteudo+/g)

    I cannot use match with regex? Could you help me?

    Thanks

    • http://lunametrics.com/ James Keener

      The problem isn’t the regex, it’s that Campaign isn’t being defined. Is Campaign in your input parameters list and is it being set in the input query correctly?

      • Mariana Alves

        Hi James, thank you for the reply but I´m defining Campaign

        bigquery.defineFunction(
        ‘defaultChannelGroup’, // Name of the function exported to SQL
        [‘Medium’, ‘Source’, ‘fullVisitorId’, ‘visitId’,’adNetworkType’,’Campaign’,’SourceMedium’], // Names of input columns

        Can I use match to regex?

        Thx

        • http://lunametrics.com/ James Keener

          Another issue you may be encountering is that trafficSource.campaign isn’t always set, and may be NULL, which will be undefined in the UDF/JavaScript. In the UDF you could do something like var specific_campaign = false; if (row.Campaign) { specific_campaign = row.Campaign.match(/pqbdg|BRD_Netshoes|Pesquisa_Netshoes|Google_cpc_conteudo+/g) != null; } (the != null to turn it into a boolean).

          • Mariana Alves

            Thanks James! It´s works to me! 🙂

  • Bob

    Thanks for the tutorials, but I’m confused to how to check which product the customer clicked. Using hits.type=’page’ and hits.eventInfo.eventCategory=’Product Details Page’ seems not right

  • Jesper

    In my opinion, BigQuery is missing a free or at least cheaper access to raw data at hit level.

    Thus I would suggest to try out https://scitylana.com. Scitylana will pull all raw data unsampled, user by user, click by click from GA (free edition) onto your local hard drive. From here on you can use data in Excel, Power Pivot, Power BI, SQL Server or whatever you like.

Contact Us.

LunaMetrics

24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.