Case Study: Audience Modeling with Google Analytics 360 and Google Cloud Platform

/

pbs-case-study-behind-the-scenes

Google recently published a case study with LunaMetrics and PBS where we used Google Analytics 360, BigQuery, and Google Cloud Datalab to perform a clustering analysis on website audiences. The case study is worth checking out, but here’s a quick summary.

PBS.org is a digital content hub with supporting information and streaming video for PBS television programming. PBS has known that for a long time that; in the words of Amy Sample, Sr. Director of Strategic Insights, “there is no ‘average’ user of PBS.org – rather, there are categories of users who do different things.” However, the size of PBS’s data, over 330 million sessions in a year, makes it impossible to easily recognize patterns of behavior in web analytics data. So PBS worked on a Data Science Solutions project at LunaMetrics to apply data mining techniques to segment audiences.

This type of analysis enables opportunities for audience-based personalization to improve the onsite experience, as well as advanced reporting and advertising opportunities. You can read the full case study for more information, however, in this blog post, I want to take a few minutes to focus a little more on the process and tools involved in crunching this data.

Google Cloud Platform

We leveraged a number of Google Cloud Platform tools to complete this analysis.

The process started with Google BigQuery, since Google Analytics 360 offers an automatic export of data into this cloud database tool. BigQuery is a powerful and fast way to query the large dataset of web behavior data (on the order of terabytes) to distill and categorize into an input dataset for data mining. Here on the Luna blog, we’ve written lots about the uses for GA data in BigQuery and how to get started using BigQuery.

Extracting the Right Data

PBS Case Study Querying with BigQuery

The first step was extracting the relevant data from Google Analytics data in BigQuery.

Even having distilled the data down into a size of gigabytes rather than terabytes, this was too large a dataset to analyze on a typical laptop computer. Dedicated computing hardware for a special project like this doesn’t really make sense, but Cloud Platform has our back with Compute Engine, which gives us the ability to spin up a powerful computational machine just for when we need it. Specifically, Cloud Datalab gave us an environment to run on Compute Engine with just the data analysis tools we need (see below for more details). We also used Cloud Storage to interchange data between these services.

Doing all this in the cloud, from data storage to querying to computation, means we never had to download the data locally. And because these are all managed services, we don’t have to worry about configuration, admin, updates, and all that IT stuff that gets in the way of getting the actual work done, because it’s all taken care of by Cloud Platform.

Cloud Datalab: Python Tools for Big Data Analysis

Cloud Datalab is a beta member of the Cloud Platform family. It allows an easy, one-click deployment of a tool called Jupyter to a Compute Engine instance. Jupyter is tool and a format for running Python code in a “notebook”, where you can combine documentation, commentary, data, and code all in one place. You can think of it as a lab notebook with built-in computation power, and it’s a pretty commonly used tool in Python. (It used to be called iPython, and you may know it under that name.)

PBS Case Study Jupyter Notebook

Jupyter notebooks provide an easy way to analyze and document data with Python.

When you’re doing data science work, there are two main scripting languages you might turn to. One choice is R, which is designed specifically for statistical computation. We’ve written lots about R and Google Analytics in the past on this blog.

The other choice is Python, which is a general-purpose scripting language that now offers a number of libraries for data science work. If you’re already familiar with Python (and with Jupyter notebooks), this is an easy way to get started. The following libraries will get you off on the right foot:

  • Pandas – reading, organizing, and filtering data, handling missing values, etc.
  • NumPy – dealing with arrays and matrices
  • SciKit-Learn – data mining and analysis tools

There are also a number of libraries for creating charts and visualizations. Pandas and NumPy handle a good deal of the underlying work around storing, reading, and manipulating data, but SciKit-Learn is where the magic happens. It includes modules for techniques such as regression, classification (image recognition, for example), and clustering.

Clustering is the type of problem we were tackling with the PBS data: given user behavior, we wanted to group users into audiences who behaved similarly. A clustering algorithm does exactly that – essentially taking a set of data points and grouping them into “clusters” of points that are close together. SciKit-Learn offers several algorithms for clustering. The outputs of this clustering let us describe a number of distinct audiences for PBS, which can be further used to segment their data, tailor marketing, or personalize content.

PBS Case Study Clustering

Data mining algorithms clustered user behavior by similarity across multiple factors.

Conclusions

Google Cloud Platform provides an end-to-end environment for processing and analyzing data from Google Analytics 360. And with freely available Python tools like Jupyter and SciKit-Learn, you can play around with small datasets on your laptop, or crunch giant ones in the cloud, in the same formats and with the same tools. Install Jupyter today and you can get started. Or, if you have a Google Analytics 360 dataset and don’t know where to get started, our team at LunaMetrics can help.

Jonathan Weber is our Data Evangelist, focusing on bringing the strategic value of data analysis to our customers. He spreads the principles of analytics through our training seminars and is currently writing a book on Google Analytics & Tag Manager. Before he caught the analytics bug, he worked in information architecture. Away from the computer you can find him as a flower farmer and plant geek.

  • conradlee@gmail.com

    Hi, nice post, thanks for sharing. I’m curious: what clustering techniques did you find to be most useful? How many techniques and how much tweaking did you need to find clusters that were useful?

    Also, did you essentially use Cloud DataLab as a way of provisioning a single machine with lots of memory that you could use for scikit learn? In other words, did you use it as a distributed (cluster) computing tool, or did you perform your scikit-learn analysis on a single large machine?

Contact Us.

LunaMetrics

24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.