Getting Started with R and Google Analytics


R is a statistical tool that can enrich you website data analysis. Unfortunately, R has a steep learning curve and can be intimidating to a first time user. However, the payoff is worth the effort. In this post I will help you get started by demonstrating some of the most useful R commands.

First, you need to import your Google Analytics data into R. In this post, I’ll be using the rga package to download the data directly into R from the Google Analytics API. If you haven’t done this before, take a look at these instructions. You can use this package to pull in unique pageviews to each of your blog posts.

If you are using other data sources, or are having trouble with the API, you can also import data from a csv file using the command:

In contrast to a program like Excel, in R you will not immediately see your data after it is downloaded. Although this may be disconcerting at first, this issue is easily addressed if you know what to look for.

If you are using RStudio (which I recommend), you can easily view the dimensions of your dataset in the Environment tag. Clicking on the calendar icon in the Environment tag will print out the entire data set above the console.

View Data

This will help you confirm that you pulled the correct data and that there are no glaring errors in the data. However, if you have a large data set, the process of printing out the data can take a long time. Here are a few other quick commands that will show you a snapshot of the data you downloaded:


This command will print out the first 6 rows of your dataset.



This command will show you the structure of your dataset. The key points of information available are the number of rows in your data and the labels of each of your columns. The first few values of each column are also listed.



This command will print out a de-duplicated list of values.


Summary Statistics and Histograms

The Google Analytics interface is the best place to find traffic and engagement totals for the whole site and broken out by different dimensions. However, it is difficult to get certain descriptive statistics like averages or standard deviation. These statistics help you answer questions like, “How is the average blog post performing?”, or “How much does performance vary between different posts?”

The summary command gives you a quick view of the range and average of the data set. Here you can see the minimum value, the maximum value, each of the quartiles, as well as the median and mean. This should give you a good feel for the average performance of a blog post. Also, you will get a good idea for the range of performance between your best and worst performing posts.


You can also compute the standard deviation using the function sd.


If you want to get a better feel for how different posts are performing, you can visualize your data in a histogram. You can use a popular package called ggplot2 to make the histogram.


Trends over Time

The Google Analytics interface offers great tools for plotting trends by day, week, or month. Hourly timelines are also available in a few reports like the Audience Overview. However, you can use R to graph trends by hour or minute by any dimension. This could be particularly useful for investigating when a particular article was trending on Twitter, or when problems relating to a particular browser arise. In order to create the plot, first pull the appropriate data by hour:

Then create the line graph using ggplot2:


Hypothesis Testing

R is a great tool for running statistical tests. For example, you can check if the conversion rate or bounce rate for a certain landing page is significantly better than another landing page. You can check for statistical significance using a difference of proportions test. First, you need to pull the number of conversions and the number of observations for each landing page:

Next, use the R function prop.test to check for significance:

Prop Test

There are several important pieces of information here, but if you are new to hypothesis testing, the thing you should pay attention to is the p-value. This tells you how likely you would see a difference in performance that extreme if the inherent convertibility of each page is actually the same. A lower p-value gives you more confidence that there is a real difference between the two pages (checking for p-values under 5% is typical). Quick tip: R will sometimes print out the p-value using scientific notation. Don’t let this fool you.

These are a few tips to get you started working on the many tools and packages that R makes available.

Becky is a Data Scientist at LunaMetrics. She started deriving equations and building calculators in high school and ended up with a Masters in math from Georgia Tech. Her experience in data analysis and reporting has given her a great appreciation for data-driven decision making. Becky enjoys swimming, working on puzzles, and spending time with her husband Jonathan.

  • Nico

    Big fun of your GA and GTM book and articles.

    Great article for where to start with Google Analytics in R.

    And if I can suggest, great packages to use for GA and WMT are also RGoogleAnalytics, googleAnalyticsR, googleAuthR and searchConsoleR, which for me they make life even easier with R.

    ggplot2 is an awesome package and going deeper with it can create great line graphs (I use it to make reports with graphs identical to the ones you see on GA)

    • Cecio82

      Ya, ggvis package, right?

      I also suggest RGA (not “rga”) package for managing Google Management API, Core Reporting API, MCF API and so on. “forecast” package is well suited for handling time series, which is common task for a web analyst.

      Other useful analysis techniques for a web analyst, easily provided by R, are cross correlation functions which are useful to find correlation between time series at lag “N” (ie: online/offline sales comparison, avg time to purchase given a specific action [let’s say “add to basket”], etc…).

    • Becky West

      Thanks for your great suggestions Nico and Cecio82. I always love hearing what other people in the field are up to.

      Yes, the RGA vs rga is a bit confusing, but I agree that (CRAN) RGA has a lot of great features.

      Have either of you used the ChannelAttribution package on CRAN? I stumbled across it the other day and am curious if other people in the industry have found the Markov representation useful.

      • Cecio82

        I’m studying for a better comprehension of Markow Chain/Model theory, implications, limits and applications right now.

        I have business problems that I think this approach can solve (I have sequences of events and one of more target events. I’d like to predict probability to reach the target events, given previous sequence of events).

        For instance let’s say that a purchase is the target event and that you are able to identify, for each customer (or cookie), a sequence of events. Let’s say: “First Entry”, “Search for an item”, “Product View”, “Add to Wishlist”, etc.

        It would be useful to build a model that, given a specific sequence of these events, it can predict the likelihood to purchase.

        I’m not a statistician, so I’m not sure if I’m facing the problem in the right way. I’ve tried with logistic regression, but using a different kind of dataset (with dummy variables for “has seen an item page”, “has search for a product”, etc…) and with poor results in terms of predicting capability.

      • The ChannelAttribution package is nice as it smooths away a lot of the complications of implementing a Markov model via a nice input data format that is well suited to the multi-channel funnels export from GA (you will recognise the example data for its example app is from there). It is a little blackbox for using day to day though, it would need some justification to a client why you prefer using its model rather than last touch, for example.

        • Cecio82

          Hi Mark, I really appreciate your googleAnalyticsR 🙂

          I’m trying ChannelAttribution package right now and i’m wondering if anyone is able to explain the main differences between Markov Model and Shapley Value approach (Data-Driven Attribution which is Google Analytics Premium feature) in terms of implications, limitations, etc…

          I’ve had very different results between these models for the same channels.

          I guess it’s related to non-converting paths, which are taken into account in DDA model (not in ChannelAttribution package model), but it’s simple guessing…

          Thank you all.

          • DanR

            Hi Cecio82,

            Essentially Markov Model is based on Markov chains which observe each event as independent of all other events. Therefore if you are observing a social channel then the likelihood of you moving to an organic, ppc or email channel will only be based on the fact that you are currently observing a social channel. Shapley value on the other hand is based on game theory and will include information on all the channels simultaneously. Therefore you shouldn’t expect the same answer when using these two different methods. Although I’m sure you are right about the non-converting paths and the difference this will account for. The inclusion of non-converting paths was actually what got me interested in the markov chain solution. However it is difficult to say which method is better. Since I dont have access to GA premium anymore I focus more on Shapley and regression modelling.

            Hope this helps!

          • Cecio82

            Hi DanR, really appreciate your reply!

            I’m wondering if the “customer journey” problem can be reduced to a binary classification one (customer journey who converts vs not) in order to use any ensemble methods (gbm, rf, xgboost) or linear model (logistic) for modelling purporse.

            Probably impossibile to achieve without bigquery (and Premium solution).

            Here some references:

          • DanR

            Hi again Cecio82,

            When I set up the Markov chain model I reduced it to a binary classification, choosing to only look at whether a customer converted or not. Then used ROC curves to compare 2nd, 3rd and 4th order markov chains according to how well they fit the data. However this method is only possible if you have access to the non/converting customer touchpoints as well. As you mentioned this is probably only possible if you have access to GA premium (please let me know if you manage to find a way around it!). Although there are plenty of examples of people using regression and boosted models for attribution I have yet to implement such a solution. It would be interesting to see what you come up with.


    • Thanks for the mention of googleAnalyticsR Nico 🙂 Now is a good time to check it out as its just published access to the new v4 API and BigQuery integration with GA 360.

      I’d also mention the other GA->R packages I know to suit all needs: rga, RGA, RGoogleAnalytics, ganalytics, and GAR. Its quite a vibrant area at the moment, with other connections to BigQuery, AdWords, Twitter, Facebook etc.

    • Is there a way to discern which GA/R package to use? Sounds like there are so many — wondering if there is a resource that speaks to the pros/cons of each.

      I’ve used the googleAnalyticsR package and it seems to do a great job, so probably sticking with that unless certain packages are best for certain data analysis scenarios.


Contact Us.

Follow Us



We'll get back to you
in ONE business day.
Our Locations
THE FOUNDRY [map] LunaMetrics

24 S. 18th Street
Suite 100

Pittsburgh, PA 15203


4115 N. Ravenswood
Suite 101
Chicago, IL 60613


2100 Manchester Rd.
Building C, Suite 1750
Wheaton, IL 60187