Compare Actual vs Predicted Data with Google Analytics and CausalImpact

/

blog-casuallimpact-r (2)
Google Analytics (GA) is a great tool for collecting and analyzing web site data, but it’s not without its limitations. If you need to run more advanced analyses, you probably find yourself pulling the data out of GA and putting it into other complementary tools, such as R, Tableau, Shufflepoint (and more!). Today’s focus will be on CausalImpact, an open-source package in R developed by the data gurus themselves (Google).

CausalImpact Overview

CausalImpact allows you to measure how different ad campaigns, such as paid search, affect ROI outcomes, such as sales. You may be thinking, But Google Analytics already has that data. And GA does that have data! However, it doesn’t account for something very peculiar…

Peculiarities in Paid Advertising

Let’s say that, on average, you get 100 sessions/day from organic search traffic without any advertising efforts. Then, your team decides that it would be a good idea to run a paid search advertising campaign for a new product or service that was just launched. After beginning the campaign, your sessions increased, on average, from 100 sessions/day to 200 sessions/day. Your team is super excited! Why wouldn’t they be? Your average sessions/day doubled!

Now let’s look a little closer… In Google Analytics, you can see the number of sessions/day that came from your paid advertising efforts. You notice that it says 125 sessions came from your ad. Wait a minute! It appears as though 25 ((200 combined organic and paid sessions) – (125 paid sessions)) of your sessions from organic search were “cannibalized” by your ad!

In other words, based on the history of your site, you could’ve expected to get approximately 100 sessions/day without advertising. Therefore, although it may seem that you got 125 sessions from your ad, you actually paid for users who would’ve come to your site regardless.

So how can you see the number of additional sessions that actually came from your ad, especially since you don’t know what would have happened to your traffic if you didn’t run the ad? Glad you asked!

CausalImpact Saves the Day!

CausalImpact looks at the net difference between the total number of a certain metric, say sessions, that you would have received without the advertising campaign (model predicted value) and the number the same metric that actually occurred (data in GA).

CausalImpact is majestic in that it uses causal inference, which looks to see if something is likely to be the cause of something else, to predict the value of your desired metric. In this case, it uses Bayesian statistics in structural time-series models to determine whether or not the causes are related.

It is important to note that CausalImpact does make the following three assumptions:

1. The control is not affected by the marketing intervention.
2. There’s a stable relationship between the control and the affected segment from pre.period through post.period.
3. You have an understanding of the spike and slab prior in the time-series model.

How Does it Work?

Any number of extraneous conditions (anything besides your advertising efforts) could cause an increase or decrease in your web site traffic. So as per my previous example, you may indeed receive 100 sessions/day, on average, but what if your traffic increased to 200 sessions for a few days due to an extraneous condition (such as bad weather leading more users to be online). I’m sure you can see how this could skew your results!

In order to get around this, the time-series model allows you to use a control for more accurate data. One of the most popular ways to get this control is to create it! Thus, it is deemed your “synthetic control”.

Consider the following… Looking at the entire month of March, you ran an ad that targeted users in Pittsburgh from March 15 through March 31; however, New York City was a similar metro that didn’t have an ad running at any point during the month of March. As a result, your control would be a segment of traffic from New York City from March 1 through March 31, and your actual data would be a segment of traffic from Pittsburgh from March 1 through March 31.

Essentially, you need to compare similar segments in order to yield accurate predictions: One segment with the marketing intervention and one segment without.

A Step-By-Step, How-To Guide

Now you may be thinking, This sounds great and all, but it seems too complicated. Alas, have no fear! The explanation is here.

1. The basics.

You’re going to need to know how to pull the data you need out of GA and put it into R. Becky West, a fellow Lunametrician, wrote an easy-to-understand, step-by-step process on how to do so here. Pay attention to the following in her blog:

  • Downloading R/RStudio
  • Installing RGA/Authenticating GA
  • Choosing your view in GA (and storing it in a variable, such as viewId)
  • Exporting via Graphic into R

2. Getting things ready.

Once you have a basic understanding of those concepts, you’re ready to move along! After opening RStudio, do the following:

3. Pull the data.

If you have trouble, feel free to reference Becky’s post. For this blog example, we’ll use the following code:

Please note that you need to pull your segment of data by its ID, which is admittedly tricky to find! One way to find your segment ID is to apply the segment to the All Pages report, and then look towards the end of the page URL. Everything after the word “user” is your segment ID, and should be added into the following format: gaid::segmentID.

4. Cleaning the data.

In R terms, you now have two lists stored in two different variables (gaDataIntervention and gaDataNonIntervention). If you don’t believe me, you can type in the following code to see what I mean:

When you run the CausalImpact() function, you can only have one set of data. As a result, we’ll combine the two sets into one:

5. Defining the variables.

Define your pre.period and post.period!

Reminder, pre.period = date range in pulled data before intervention, and post.period = date range in pulled data after intervention. Use the following as an example:

6. Run the package!

Warning – Nothing will show up on your screen just yet.

Interpreting Your Results

CausalImpact makes interpreting your results relatively easy! As you found that nothing showed up when you ran the CausalImpact function, you can use the following code to actually see your results:

Screen Shot 2016-04-05 at 4.17.56 PM
Isn’t it B-E-A-U-T-I-F-U-L?! For the purposes of keeping Google Analytics data confidential, we used the same simulated data that Google used.

Original (First) Graph:

Solid, Black Line: Observed data before the intervention
Dotted, Blue Line: Model predicted values for what would have occurred without the intervention

Pointwise (Second) Graph:

The net difference between the observed and predicted response on the original scale, or the difference between the solid, black line and the dotted, blue line on the original graph.

Cumulative (Third Graph):

Dotted, Blue Line: Individual causal effects added up in time, day after day.

For all three graphs, the light blue shaded area represents the results in a 95% confidence level. The farther that the graph extends past the beginning of the intervention, the less certain of the causal effect; hence, the larger the shaded area.

Statistics in the Report

What’s a graph without some Bayesian statistics? To see these statistics behind the graph, just enter the following:

Screen Shot 2016-04-05 at 4.20.38 PM

Furthermore, the summary function coupled with the report argument describes the summary results in a report format! Pay special attention to the last paragraph. Recall the concern with extraneous reasons for differences in traffic? This paragraph lets you know the calculated likelihood of the effect being caused by an extraneous reason. If you need help understanding these statistics, try the following:

Screen Shot 2016-04-05 at 4.19.21 PM

Moving forward, CausalImpact can help you better understand the peculiarities in paid advertising by determining how much additional traffic you actually receive from running your campaign. With this information, you can see a strong prediction of how many organic sessions were cannibalized by your advertising efforts. Therefore, you can refine your advertising campaign strategies accordingly. Taking it a step further, you can use these strategies to compare the data you received with the expected data predicted by models with synthetic controls.

Kaelin Harmon is a Junior Analyst at LunaMetrics. She is pursuing her bachelor's and master's degrees in Data Analytics at Robert Morris University. She's all about bridging the gap between businesses and customers by turning data into actionable insights. If you see someone in a Storm Trooper onesie enjoying delicious grub at a restaurant or exploring downtown Pittsburgh, it's probably her. She's loving her new freedom from the First Order.

  • John Peterson

    Thanks for this article ans showing a way to get predictive analytics in GA .. I love the technicalities in the article and will implement it on my site. I am sharing this on m SEO FB group too I am sure they will love it. Google analytics and Gostats 2 of my favorite analytics tools ever.

  • Hoang Duc Anh

    Love your article! Great post! One of the most difficult of marketing is how to measure the effect of Marketing campaign! Google Analytics can do many things. However, to go deep into the data, GA needs other tools to supports and R is one of its best friends!
    Once again, great article! Keep posting!

    • Kaelin Harmon

      Hi, Hoang. Agreed! Thank you very much.

  • Yehuda

    This is a great article! Just to clarify… if i am interested in isolating Adwords campaigns for Branded keywords and see if they cannibalize Organic. I would create a GA segment that combines Organic sessions & Adwords (branded campaigns only)?

  • Yehuda

    A follow up question, when i view the data by date in R Studio, the Intervention and Nonintervention is the same (the same is if you export the data directly from GA), however, the totals (in this case sessions) is different (due to the segment). i am confused because the model is time series based, yet the by date data via API when pulling two different segments is identical, only totals are different. is this accurate?

    • Kaelin Harmon

      Hey Yehuda! Thank you.

      For your first question, yes, you would need one GA segment not affected by your branded keywords (only organic) and one GA segment affected by your branded keywords (organic and branded keyword).

      If I am understanding your second question correctly, the number of sessions for your unaffected GA segment would most likely be different from the number of sessions for your affected GA segment, even though both segments are taken from the exact same time frame.

      I hope this helps!

      • Yehuda

        Hey. thank you… 3rd and final question 🙂 when i ran a few scenarios, the results were +2% with p below .05. How can i translate that into actual numbers? Does that mean the affected did cannibalize, but the net effect was positive? or does that mean the affected did not cannibalize the control period. If the former, is it possible to arrive at the actual value of cannibalization? Thank you again!

Contact Us.

LunaMetrics

24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
in ONE business day.