# Marketing Channel Attribution with Markov Models in R

/

Data empowers us to better understand our users and their behaviors, while methods provide us with the means for analysis. These methods, ranging anywhere from simplistic (i.e. frequency) to complex (i.e. clustering), allow us to choose what we want to understand from the data.

A popular way to understand our users and their behaviors in Google Analytics is through multichannel attribution in the Multi-Channel Funnels Reports using simple heuristics: First Click, Last Click, and Linear Attribution. Although these methods respectively provide insight into the frequency of the first marketing touchpoint, the frequency of the last marketing touchpoint, or a sequence of equally important marketing touchpoints, a data consumer may want a different snapshot. For instance, someone who wants to understand the level of importance and/or the value of each touchpoint in relation to conversions must use a different method: Markov modeling.

Just a few, quick reasons to use this method…

2. Even if you do have GA360, the Data-Driven Attribution algorithm is a bit of a blackbox.
3. You can use any sequenced data – you are not limited to only using marketing channels.

Let’s dive in!

## The Make of Markov

A Markov model determines the probability that a user will transition from Sequence A to Sequence B based on the steps that each user takes through a site. The contents of these sequences are determined by the Markov order, which ranges from 0 to 4. Use the following as an example:

• Order 0: Doesn’t know where the user came from or what step the user is on, only the probability of going to any page.
• Order 1: Looks back zero steps. You are currently at Step A (Sequence A). The probability of going anywhere is based on being at that step.
• Order 2: Looks back one step. You came from Step A (Sequence A) and are currently at Step B (Sequence B). The probability of going anywhere is based on where you were and where you are.
• Order 3: Looks back two steps. You came from Step A > B (Sequence A) and are currently at Step C (Sequence B). The probability of going anywhere is based on where you were and where you are.
• Order 4: Looks back three steps. You came from Step A > B > C (Sequence A) and are currently at Step D (Sequence B). The probability of going anywhere is based on where you were and where you are.

Markov models account for user paths that extend past the order number by acting like a sliding window. Let’s say that User X’s steps were as follows: A > B > C > D > E > F > G. This model would show User X going from Sequence A (A > B > C > D) to Sequence B (B > C > D > E) to Sequence C (C > D > E > F), and so on until User X either exited or converted.

Choosing the best Markov order can be difficult. Without getting into a lot of detail, one way is to plot the training accuracy of the model versus the training standard deviation. The goal is to find where these two lines intersect, or where the model gains variability and loses accuracy equally.

Although it may appear daunting, having a basic understanding of the math behind the model can be helpful as well.

Luckily, this can be simplified into 3 main parts:

1. The Transition Probability (wij) = The Probability of the Previous State (Sequence A, Xt-1) Given the Current State (Sequence B, Xt)
2. The Transition Probability (wij) is No Less Than 0 and No Greater Than 1
3. The Sum of the Transition Probabilities Equals 1 (Everyone Must Go Somewhere)

ChannelAttribution, an R library, builds the Markov models that allow us to calculate the number of conversions and/or conversion value that can be attributed to each marketing channel. In other words, ChannelAttribution uses Markov models to determine each channel’s contribution to conversion and/or value.

This model focuses on solving the following issues:

1. Objectivity – No gut feelings here! Only facts.
2. Predictive Accuracy – Predicts conversion events.
3. Robustness – Valid and reliable results.
4. Interpretability – Transparent and relatively easy to interpret.
5. Versatility – Not dataset dependent. Able to adapt to new data.
6. Algorithmic Efficiency – Provides timely results.

It’s also important to keep in mind the following limitations of attribution:

1. Endogenic – Attribution is relative to underlying conditions.
2. Not Strict Causal Interpretation – Markov models do not explain 100% of the variance between marketing channel contributions. For instance, certain marketing channels may be inherently more effective in a given setting.

This library estimates the channel attribution by calculating the Removal Effect (si). Essentially, the Removal Effect is the probability of converting when a step is completely removed; all sequences that had to go through that step are now sent directly to the exit node. This calculation is done by running a large number of simulations on the Markov model with the removed step. By default, it runs 1 million simulations. This occurs for each step present in the data.

If you aren’t too familiar with R, but you’d still like to take advantage of what ChannelAttribution has to offer, there’s still hope! Or, if you would rather see the code, click here.

The link should bring you to the following:

As shown in the image, click the “Load Demo Data” button (when you’re ready, you can load your own data by clicking the “Choose File” option under “Load Input File”).

If you’re using the demo data, the options are preselected for your convenience. Otherwise, you will need to choose the delimiter that separates the values in your data. Then you fill in the column names for your variables:

• Path Variable – The steps a user takes across sessions to comprise the sequences.
• Conversion Variable – How many times a user converted.
• Value Variable – The monetary value of each marketing channel.
• Null Variable – How many times a user exited.

Then you hit “Run”! After it’s done executing (check the top right corner for progress), click on the “Output” tab. Analyze your results. In a lot of cases, you can see where this model is helpful … in some models, they give too little attribution to the channel versus they give too much attribution to the channel. Click here to jump to analysis section, complete with the bar charts and table from the output.

If you’re more familiar with R, you might like this option better as you can customize your model and graphs. To get started, follow these steps:

Next, remind yourself of the variables used in calculating the models:

• Path Variable – The steps a user takes across sessions to comprise the sequences.
• Conversion Variable – How many times a user converted.
• Value Variable – The monetary value of each marketing channel.
• Null Variable – How many times a user exited.

Build the simple heuristic models (First Click / first_touch, Last Click / last_touch, and Linear Attribution / linear_touch):

Build the Markov model (markov_model):

Perform some quick data munging for total conversions:

And now the fun part… Plotting the total conversions:

Thankfully, the process of creating the Total Value bar chart is very similar to creating the Total Conversions bar chart:

## The Long-Awaited Bar Charts, Table, and Brief Analysis

### Total Conversions

The “Total Conversions” bar chart shows you how many conversions were attributed to each channel (i.e. alpha, beta, etc.) for each method (i.e. first_touch, last_touch, etc.). Analyzing the graph, specifically the purple bar(markov_model) in comparison to the other methods, you can gain insights, such as the following:

• “alpha” was not actually as important in assisting conversions than the simple heuristics found.
• “epsilon”, “lambda”, “theta”, and “zeta” were more important in assisting conversions than the simple heuristics found.

### Total Conversion Value

The “Total Conversion Value” bar chart shows you monetary value that can be attributed to each channel from a conversion.

For instance, you can see the following:

• “alpha” was not actually not as valuable in assisting conversions than the simple heuristics found.
• “epsilon”, “lambda”, “theta”, and “zeta” were more valuable in assisting conversions than the simple heuristics found.

### Table Form – Available via ChannelAttributionApp (GUI)

Furthermore, the GUI puts all this data into a table that you can download and open in Excel if you want to create your own charts.

### Open in Excel

Although you can download the table and open it in Excel, it comes in semi-colon separated values. To convert this into usable data, you can follow these steps:

Select Column A. Then on the Data tab, click “Text to Columns”. It will bring up the following prompt:

Select “Delimited” and click “Next”.

Make your bar charts again from there.

Choose the “Delimiter”. In this case, it is the semicolon. Click “Finish”. From here, you can create custom charts to your liking.

## Opportunities Abound!

Now that you can get a better understanding of your Google Analytics marketing channel data, there is room to explore additional features of ChannelAttribution, reshape, and ggplot2. Bear in mind that although this library is mainly used for channel attribution issues, you can use it for almost any sequenced data. So get creative, and maximize your data’s potential!

Kaelin Harmon is a Junior Analyst at LunaMetrics. She is pursuing her bachelor's and master's degrees in Data Analytics at Robert Morris University. She's all about bridging the gap between businesses and customers by turning data into actionable insights. If you see someone in a Storm Trooper onesie enjoying delicious grub at a restaurant or exploring downtown Pittsburgh, it's probably her. She's loving her new freedom from the First Order.

• Doug Male

Fantastic work but….keeps disconnecting me from App when i try to run my own data?!

• Cecio82

Original paper suggests that higher order Markol Model (k=3 or 4) should provide better predictive accuracy, though it exists a tradeoff between accuracy and robustness.

• Jesse

How did you pull the null variable from GA?

• Michal Prochazka

You dont need them. Just download session and transactions from GA. Order them base on user and time, create sequences and make a mark for all user and sessions sequence which did not finished with transaction(conversion).

• Udi Sabach

• Michal Prochazka

You can use https://ga-dev-tools.appspot.com/query-explorer/ to extract data or some R library and you will see that not all conversion paths are finished by transaction. I can send you some example.

• Udi Sabach

i see, maybe i am missing something… but by definition, a conversion path would need to end in conversion. otherwise, it’s not a conversion path.

• Michal Prochazka

Conversion path “alfa>beta>gama” can be finished by conversion for 3 users and this same path can be NOT finished by conversion for another 5 users.
Sorry for using “conversion path” lets switch to “channel sequence”.

• Al3x4nd3rR

Hi Michal,

I still do not get how you are able to pull out MFC-data from GA, and look at non-conversion paths/channel sequence. Can you please share what dimensions and metrics you pull out to get these data?

• Michal Prochazka

Hi Al3x4nd3rR,
just take browserId (custom dimension), date, hour, minute, channel, source / medium, campaign and sessions + transactions.
Sort data by browserId, date, dour, minute and you have all conversion and NON conversion journeys.
Of course you can use some filter on those visitors who start shopping like “add to basket”.

Kind Regards MP

• Josef

Ignoring the fact how GA is messing visits from direct channels, such export will be too vague for any serious attribution analysis. No objection, you can get the data, just any calculation on top of such data is not of much use. Btw. you can store the time into the custom dimension as well and use three dimensions for something more useful.

• Michal Prochazka

Hi Josef,
I was elaborating how to get data, personally everyone should be curious about data quality and collection methods fist before starts with any experiments.

• F

Here some code on getting the right data frame in R.

However, the question that remains is how to get the null column? I was wondering whether you Michal could add some code on how to get the null values?

install.packages(“RGA”)
library(“RGA”)
authorize()

#Use my_accounts to find the viewId. Make sure to replace this with your viewId.
my_id = XXXXXX

# get data frame with all metrics except NULL
ga_data <- get_mcf(my_id, start.date = "30daysAgo", end.date = "today",
metrics = c("mcf:totalConversions","mcf:totalConversionValue"),
dimensions = "mcf:sourceMediumPath")

• Andy Donaldson

great…I want to use the same approach of Markov MC models for page paths but the data sets are awful should I query ga:goalCompletionLocation/ga:goalPreviousStep1 /ga:goalPreviousStep2/ga:goalPreviousStep3 ? then look at uniquepage views and conversions?
thoughts?

• Jake Weinberger

Can you please elaborate on this? I don’t understand how you can have a conversion path that doesn’t end in a conversion.

• Michal Prochazka

Sorry for misunderstanding, I will try to put it clear. Conversion path in this case is sequence of channels which brigs at least one user to your site. These sequences are growing by each new user visit until he will made a conversion. Each sequence has three fact values, number of user who came by this sequence and made order, sum of orders revenue and number of users without order who follow same path. User which made conversion last week will come to your site next week. He will start new path or “customer journey” by that “next week” visit and “null variable” will be increased for another sequence.

• Samit Gorai

I did not get it properly. When you are talking about channel attribution you mean mediums like SEO, PPC, Campaign, Email, Direct etc. Now how come a visitor have so many channel touch points and what about recency factor? This model can be very much applicable for content scoring or assigning \$value to various pages but I’m not very confident about channel attribution.

• Michal Prochazka

Hey my understanding is that alpha, beta, … are channels as you mentioned. Each sequence contains all user sessions(interactions) which was measured. In terms of recency you can check if “ChannelAttribution” or “MC” implementation calculate recency based on channel order in sequence, but input data dint carry any information about time between last interaction and conversion.

• Udi Sabach

Hello. is it possible to generate the table in R?

• Jon

Well found Kaelin. Have you managed to find a package that replicates the Game theory solution employed in GA’s DDA model? In many ways the math is simpler in their approach.

• Анатолий Андрющенко

Can you explain, where in Analytics reports i can downoad Null Variable???

• Abhishek Sinha

really helpful. Could you please explain how we come up with total_conversion_value column data?

• Brian Maher

Excellent piece, as always. What are the chances of demo extract excel or example of export data from GA? Would clarify a few points for some people here I think

• Dina Dicic

Yes that would really help! I struggle with the export from GA.

• Abhishek Sinha

really helpful. Could you please explain how we come up with total_conversion_value column data?

• Dina Dicic

• Martin Frotzler

You can either export it from within GA and import the CSV file into R or go the direct way (which is more elegant). The direct way goes via RGA library and access token, you can look it up. You then access the Multi Channel Funnel Data in R via get_mcf.

Using the channels can be a bit of a hassle as you come a across a few formating problems. The channelattribution library doesnt like empty spaces and such. But once you dealt with that in your data its quite convenient to use. The good thing is, that the general formatting A>B>C come from GA, so this doesnt need to be addressed seperately.

Have fun!

• Dina Dicic

Thank you!

• Abhishek Sinha

• Julio

Wow, thanks a lot for this. I have been searching for days and this is the first article with hard information.

• Pierre

Hello, thanks for sharing this awesome post and code !

I have a small question regarding attribution with markov model of order > 1. What happens then ? The markov chain becomes bigger so that you also have nodes like (alpha, eta) or (beta, alpha). How can you then calculate the contribution of each channel ? It’s obvious for each node of the graph but not for each channel (to answer the question, what should be my spending repartition).

I guess one obvious way would be to delete all nodes containing the channel. But I don’t see no theoretical guarantee. Something else could be to use some game theory ? Something like this maybe : http://eprints.soton.ac.uk/380534/1/GHLEFMG_FGMJHM_VJ1QM9QF.pdf ?

Again, great work and thanks for sharing.

• John Smith

Great post thanks. What is not clear, once computing the model how to use it to compute attributions of individual journeys, e.g. suppose you had “alpha-alpha-beta-delta-delta-(conversion)”, what would be the attributions for alpha, beta and delta for this journey using the model?

• Ray

Great article!

One question. Why are there large variances in attribution within the Markov model? In the simplified version, calculating the removal effect seems to be precise with no variance. How does this change when the model becomes more complex with more attribution channels? What is producing the varying results?

For instance, running the test data twice within shiny will produce different results for each attribution channel within the markov model, with variances ranging from -5% to 15% depending on the channel.

• Pablo

I have managed to get the charts. However, the channel names are showing partially e.g.: Dir, Ref and Org showing as ir,ef and rg
I would appreciate any help on this
Thanks

• Abhishek Bhartiya

You will have to remove all spaces between the words. for eg if channel name is social media posts, change it to either social_media_posts or SocialMediaPosts.

• Pablo

I have managed to get the charts. However, the channel names are showing partially e.g.: Dir, Ref and Org showing as ir,ef and rg

I would appreciate any help on this

Thanks

• F

Here some code on getting the right data frame in R:

install.packages(“RGA”)
library(“RGA”)
authorize()

#Use my_accounts to find the viewId. Make sure to replace this with your viewId.
my_id = XXXXXX

# get data frame with all metrics except NULL
ga_data <- get_mcf(my_id, start.date = "30daysAgo", end.date = "today",
metrics = c("mcf:totalConversions","mcf:totalConversionValue"),
dimensions = "mcf:sourceMediumPath")

However, the question that remains is how to get the null column? Cound anyone (maybe the author) add some code on how to get the null values?

• Jonas

Thanks a bunch!

From which section in GA should I export? Assisted conversion/Top Conversion Paths/…?

It should be Multi-Channel Funnel Report

• CamilaPaz Pineda Parra

I’m going to use the Markov Chain package for make a channel attribution model, I’m going to use the routes that do the customers with the different channels through their customer’s journey, but I have a doubt, How can I be sure that in the routes that I am using belong to a period that satisfy the Markov condition? Does the package “markovchain” select this period? Or should I choose a period in which there is a recurrence?

Thanks

• Eric Brown

Thanks Kaelin!

I’ve used your suggestions on my GA multichannel data. Are there any ways the ChannelAttribution package can be used to suggest an optimal custom attribution model?

• Abhishek Bhartiya

Hi Kaelin!

Really helpful and so far the most detailed and useful article on statistical attribution modelling. I am trying to use the shinyapp link that you have provided and it worked fine a couple of times but after that it has been giving a server load error all the time. Is there any other link which you can provide for this? It would be a great help.

Thanks,
AB.

• Great information about marketing channel attribution with Markov models. Thanks for sharing

• ezequiel boehler

Hi! First of all thanks for this great information!

I have been running the model a few times already, with different accounts, just for testing of results, and I came up with some doubts.

Which dimension would be the best fit for running a markov chain model?
Though source/medium sounds like the optimal, it leaves out information as brand or unbrand ppc, remarketing vs prospecting and so. While using campaign, leaves out direct and organic traffic.

Should I disable autotagging and start tagging those type of parameters at a source/medium level?
My first answer would be no, but otherwise I dont see how doing attribution at source/medium level as it is now gives actionable insights.

Example: It turns out facebook/paid gets more conversions when running the markov model. Ok great. But I have 15 different campaigns on Facebook Ads, so how do I know in which should I allocate more budget?

Any ideas? Thanks!!

• Robbie Nohra

Question: why does the order even matter, given that that is outside the realm of influence? If I click a Facebook ad, and then later click a Google ad, there is no way for a marketer to influence that order: i.e. whether or not I visit Facebook and then Google or vice versa, all you can control is whether or not an ad exists in each medium, no? If we notice that people tend to visit two particular channels consecutively, how can that information actually be used?

All you can do is ensure that ads continue to exist in both mediums, but that would be the case regardless of the order. E.g. if A -> B’s converted but B -> A’s did not, you would still want to make sure ads existed in both A and B to keep the A -> B conversions. Even if A-> B’s and B->A’s both did not convert you would still need to look at both A and B in isolation to determine whether or not to keep ads in both mediums. Like I don’t see how the order will ever matter from a practical point of view.

• Dante

Great article!
You mentioned that the Markov model in the graph signifies assisted conversions. Could you please shed some more light on it?
Thanks a lot.

1.877.220.LUNA

1.412.381.5500

getinfo@lunametrics.com

Questions?
We'll get back to you
Our Locations
THE FOUNDRY [map]

24 S. 18th Street
Suite 100

Pittsburgh, PA 15203

THE STUDIO [map]

4115 N. Ravenswood
Suite 101
Chicago, IL 60613

THE LODGE [map]

2100 Manchester Rd.
Building C, Suite 1750
Wheaton, IL 60187