Replicating the Google Analytics All Pages Report in BigQuery

/


While BigQuery is often the perfect tool for doing data science and machine learning with your Google Analytics data, it can sometimes be frustrating to query basic web analytics metrics. In this post, I’ll walk through calculating some fundamental metrics at the page level by replicating the All Pages report for the Google Merchandise Store in BigQuery.

I will be using Google’s sample dataset for the Google Merchandise store for August 1, 2016. You can compare this data to data in Google Analytics by viewing the Google Analytics demo account. I will be using the standard SQL dialect.

Word of warning: Although this post discusses the All Pages report – probably the most popular report in Google Analytics – it is not for the faint of heart. We will discuss a lot of technical details surrounding Google Analytics definitions, the SQL language, and the format of the BigQuery data. If you have not worked in BigQuery before, you may want to start off with a gentler introduction.

Primary Dimension – Page

In BigQuery the page dimension, or URL, is stored in the field hits.page.pagePath. Recall that Google Analytics data is stored at the session level. To access hit level information, we will need to unnest our table by hits. The query below shows all URLs that were visited by users.

Pageviews

Next, we want to add in the number of pageviews associated with each URL. If you look at the Google Analytics schema, the only pageview related field you will see is totals.pageviews. Use extreme caution when using this field! It is a session level field (it’s counting the total number of pageviews in the entire session) and it should only be mixed with hit level fields with extreme care. This metric is not useful for replicating the All Pages report. Instead, we will need to calculate this metric ourselves.

So, what is a pageview and how do we calculate it? We often think of pageviews as the number of times that a page was loaded in the browser. In Google Analytics, this is measured by the number of times we send a “pageview” hit to Google Analytics. (Recall there are other types of hits that can be sent to Google Analytics, such as event, social, timing, and transaction hits.)

To calculate pageviews in Google Analytics, we need to count the number of times a hit of type PAGE is associated with each URL. As above, we will use the field hits.page.pagePath to identify the URL. Then filter the results by setting the hits.type field equal to ‘PAGE’ and count up the hits.

Unique Pageviews

In Google Analytics, unique pageviews represent the number of sessions during which a page was viewed. Once again, we want to focus on the hits.type equals ‘PAGE’, but rather than count the hits, now we want a count distinct on the number of sessions.

To count unique sessions, we need a unique way of identifying a session. The visitId field looks promising for this. However, this identifier has some issues when a session ends a midnight, so I recommend using visitStartTime instead. Note that the visitStartTime will only be unique per user, so we need to concatenate the visitStartTime with the fullVisitorId to get a globally unique session ID.

Using this session identifier, we can now compute unique pageviews.

Average Time on Page

The formula for average time on page is: Total Time on Page / (Pageviews - Exits).

We will need to compute each of the components of this formula separately, then compute the average using the formula above.

Pageviews

Easy – we have already calculated this above.

Exits

Fortunately, exit pages are designated with the hits.isExit field. We just need to count these up to get total exits on the page.

Total Time on Page

This one is tricky. Recall that Google Analytics calculates the time on page by comparing the timestamps of hits sent in on different pages. For non-exit pageviews, the time on page is computed by subtracting the timestamp of the next pageview hit from the timestamp of the current pageview hit. For any pageview that is an exit, we compute the time on page as the difference between the timestamp of the last interaction hit minus the timestamp of the pageview hit.

If a page is an exit page and it’s the only page of a session, and there are no interaction events, then this page is considered a bounce and does not contribute to total time on page. Put more simply, bounces do not affect the avg time on page metric.

The timestamp information for each hit can be found in the hits.time field. Note that this field is measured in milliseconds, so you will need to divide the final time by 1,000 to get back to seconds. For each pageviews, we need to compute two additional columns – the timestamp for the next page as well as the timestamp for the last interaction hit.

We can compute the timestamp for the next pageview (within that session) by using a lead window function.

We can create a new column with the timestamp for the last interaction hit by finding the maximum timestamp out of all of the interaction events. Interaction hits are designated by the field hits.isInteraction.

Be careful when combining these two queries. Filtering in the where clause is a little tricky because we need to include events to get a last interaction hit value. Therefore, we need to be a little patient before calculating the next_pageview timestamp.

Now we can compute time on page by comparing these two columns to the hits.time of the pageview.

Put It All Together

Finally, we are ready to aggregate and compute the average time on page.

Entrances

Fortunately, entrances by page are easy to compute. There is a field called hits.isEntrance that we can use to determine whether that pageview is an entrance.

Bounce Rate

Bounce rate is another complicated field in BigQuery. The formula for bounce rate is: Bounces / Sessions. We will need to compute bounces and sessions separately.

Bounces

Bounces are attributed to the first interaction hit in a session in which there is exactly one interaction event. We can determine if there was exactly one interaction event in the session by using the totals.bounces field. Now we just need to find the first interaction hit in the session.

The hits.hitNumber field will work well for this task. We will just need to use a window function to identify the hit number for the first interaction hit in the session. (Note that we can determine if a hit is an interaction hit by using the hits.isInteraction field.)

Sessions

Sessions are attributed to the first hit (interaction or not) in a session where there is at least one interaction event. (Keep in mind that sessions are closely related to but not the same as entrances). Fortunately, we can use the totals.sessions field to identify whether there is at least one interaction event. Now we just need to identify the first hit in the session.

To find the first hit in the session, it is very tempting to just check if hits.hitNumber equals 1. Unfortunately, like visitId, this field does not restart at midnight. So, we will need to use a window function to identify the first hit number in the session. We can then use this column to compute the number of sessions associated with each page.

Aggregate and Combine

Now we just need to aggregate bounces and sessions then divide.

% Exit

The formula for % Exit is: Exits / Pageviews.

We have already calculated pageviews and exits above. Now we just need to combine these into a single query.

Page Value

Calculating page value is beyond the scope of this post. In general, I would recommend using the Google Analytics Core Reporting API to pull the page value or create your own custom content scoring method.


Accessing standard Google Analytics metrics through BigQuery can be more painful and time-consuming than using the Core Reporting API. However, BigQuery can give you the power to add more advanced reporting and analysis to the standard Google Analytics reports, and adding in those standard metrics can help provide context to your analysis.

Becky is a Data Scientist at LunaMetrics. She started deriving equations and building calculators in high school and ended up with a Masters in math from Georgia Tech. Her experience in data analysis and reporting has given her a great appreciation for data-driven decision making. Becky enjoys swimming, working on puzzles, and spending time with her husband Jonathan.

  • Jean-François Verville

    Hi Becky,

    Super interesting. When you compared BQ results to the All Pages GA report, were you perfectly matching on all metrics? I have gaps for some metrics even if I did exactly as you did. Wanted to set my expectations at the right place. Thanks.

    • Jean-François Verville
      • Jakub Otrząsek

        Hi all,
        Please keep in mind timezones and double check that you are using a view which is linked with GBQ.
        Linked view should be “RAW data” – which means no alterations.
        When for most of the time we use “primary” / “master” etc kind of view with exclude/include filters (and sometimes with a different timezone settings).
        It is easy to miss this.
        Regards
        Jakub

  • Aleksandr Osiyuk

    Thanks for the good article! I will publish an article in my Telegram channel: https://t.me/BigQuery
    There I talk about interesting features of BigQuery.

    • Demasturbot

      No shit! Exactly where I came from!

  • Peter Blakemore

    Why do I get a FROM FROM error when I copy and paste your code into the online BGQ editor?

  • Ben Zitney

    I’ve never analyzed GA data for BigQuery, so my question here is why would you want to do this? Is this just an exercise showing how to replicate certain reports/metrics in GA? Or is there a specific reason why you would want to replicate this specific report? What could you do with it in BigQuery that you couldn’t in the GA UI?

Contact Us.

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.
Our Locations
THE FOUNDRY [map] LunaMetrics

24 S. 18th Street
Suite 100

Pittsburgh, PA 15203

THE STUDIO [map]

4115 N. Ravenswood
Suite 101
Chicago, IL 60613

THE LODGE [map]

2100 Manchester Rd.
Building C, Suite 1750
Wheaton, IL 60187