Upcoming LunaMetrics Events
San Francisco, Apr 28-30 Los Angeles, May 12-16 New York City, May 19-23 Chicago, Jun 16-18

Tracking PDFs and Other Downloads Inside Google Analytics… Server-Side!

Google Analytics is great for tracking just about anything – inside a webpage. Google’s JavaScript code sits nicely on your website’s HTML pages, and tracks all of your site’s pageviews, visitor session information and various user interactions and events (including resource downloads from links, with some minor tweaking).

The thing about JavaScript though is that it needs to live somewhere on the visitor’s browser (client-side), tucked neatly inside the <head> tag. GA can only track resources that are downloaded when JavaScript is involved.

This is all fine and dandy – except that the world doesn’t always work that way. People sometimes hotlink to PDFs, Word docs and images and visit them directly. And thank goodness! Can you imagine a world without direct links to imgur.com memes?

imgur-meme2

In situations where visitors access a non-HTML resource directly, Google Analytics is not the tool for the job. An analyst would have to view raw server logs to determine how many times a PDF was requested.  For example, we determined that nearly half of a particular client’s PDFs were downloaded directly from an email blast campaign. Since there was no visit to the website involved, Google Analytics was clueless. Yet important data was sitting idle inside server logs, buried and inaccessible.

Enter server-side Google Analytics.  Note: This is a PHP-only solution.  Conceptually this can work in other environments, but PHP/Apache is my flavor of choice.

Thomas Bachem and others at United Prototype built a library for server-side Google Analytics called php-ga.  Their code allows your PHP server to hit Google Analytics by simulating a JavaScript request.  You’ll see what that means in a sec.

By integrating with this library, we can 1) set an .htaccess rule to reroute all PDF downloads through 2) a custom download.php script, which hits the library and 3) fires off a Google Analytics call.  You can keep your same folder structure and you won’t need to move any of your existing PDFs!  And no additional cookies required!  Let’s dig in.

 

Tools You Need:

  1. Apache server with PHP 5.3 or greater
  2. Notepad++, TextMate or a similar quality text editor
  3. FTP Client like FileZilla
  4. Google Analytics account

A Cautionary Word:

I would strongly recommend you use a new Google Analytics property for this.  Why?  Because the GA session data doesn’t fully match up from the client-side (JavaScript) to the server-side (PHP).  Please play it safe and use another property.  Otherwise you’ll have a whole lot of New Visitors and you’ll wonder why!  Don’t say I didn’t tell you so! :-)

 

Step 1.

Download the php-ga library.  Look inside the folder labeled src and move autoload.php and the GoogleAnalytics folder to your website’s root directory.

Step 2.

Create a new PHP file called download.php.  This is where the magic happens.  The script 1) loads up the php-ga library, 2) creates a new visitor hit to GA, 3) tracks a virtual pageview for the PDF, 4) uses cURL to set a custom user-agent called LunaMetrics123 (you’ll see why later on), and 5) fetches the PDF and sends it to the browser.

Try it out!  Include the following code (be sure to add your GA-ID and domain below):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
<?php
/* Using Server-Side Google Analytics to Track PDF downloads
http://www.lunametrics.com/blog/2013/06/04/tracking-pdfs-google-analytics-server-side/
Alex Moore  @alMoo
*/

// Set header MIME-Type for PDF
header("Content-type: application/pdf");

// Google Analytics Server Side
$GA_ACCOUNT = "UA-XXXXXXX-X"; // replace with your GA-ID
include "autoload.php";
use UnitedPrototype\GoogleAnalytics;

// Initilize GA Tracker
$tracker = new GoogleAnalytics\Tracker($GA_ACCOUNT, 'YOURDOMAIN.COM');

// Assemble Visitor information
$visitor = new GoogleAnalytics\Visitor();
$visitor->setIpAddress($_SERVER['REMOTE_ADDR']);
$visitor->setUserAgent($_SERVER['HTTP_USER_AGENT']);
$visitor->fromUtma($_COOKIE['__utma']);
//$visitor->setScreenResolution('1480x1200');

// Assemble Session information
$session = new GoogleAnalytics\Session();
$session->fromUtmb($_COOKIE['__utmb']);

// Get filename from the previous request
$filename = parse_url(urldecode($_SERVER['REQUEST_URI']), PHP_URL_PATH);
//$filetype = preg_replace("/.+\.(.+)/i","$1",$filename);

// Assemble Page information
$page = new GoogleAnalytics\Page($filename);
$page->setTitle($filename);
$page->setReferrer($_SERVER['HTTP_REFERER']);

// Track page view
$tracker->trackPageview($page, $session, $visitor);

// Create the URL for the PDF
$protocol = ((!empty($_SERVER['HTTPS']) && $_SERVER['HTTPS'] != 'off') || $_SERVER['SERVER_PORT'] == 443) ? "https://" : "http://";
$url  = $protocol.$_SERVER['HTTP_HOST'].$filename;

// Fetch the PDF (cURL it)
$ch = curl_init($url);
// This creates a user-agent string that we set .htaccess to ignore (preventing an endless loop)
curl_setopt($ch, CURLOPT_USERAGENT, "LunaMetrics123");
$data = curl_exec($ch);
curl_close($ch);

// For good measure
exit;
?>

 

Step 3.

In the root of your website, open the .htaccess file.  This is a special system file that may be hidden on some machines.  You may need to create a new blank .htaccess file.  Don’t forget the leading dot!

(If working locally on Mac OS X, you may not see hidden system files by default.  This link can fix that.)

.htaccess provides instructions for the Apache server when a visitor tries to access files.  Our job then is to intercept the request for a PDF and reroute it to download.php.  Remember the LunaMetrics123 custom user-agent string earlier?  That’s a handy hack to make sure that the server doesn’t enter an endless loop… without setting a cookie!

Inside .htaccess, include the following lines:

# Use PHP 5.3 by default (HostGator or other shared hosts might require this line)
AddType application/x-httpd-php53 .php

# Turn on RewriteEngine (it might already be on for you)
RewriteEngine On

# If the user-agent string is NOT set by our PHP script...
RewriteCond %{HTTP_USER_AGENT} !LunaMetrics123 [NC]

# ... rewrite all requests for PDFs to our PHP script
RewriteRule ^.+\.pdf$ /download\.php [L,NC]

 

That’s it!

Once your code is in place, go to your website and try to download a PDF… directly.  Don’t visit the website first.  If you don’t have a PDF handy, here’s one.

Then… drumroll.  Open up your Google Analytics and view the Real-Time reports.

Real-Time PDF Server-Side

 

Take a look! You’ll see that you should have an active page (the PDF):

Top-Active-Pages-PDF

This is just a start.  There are so many things we can do with server-side Google Analytics.  Why stop at PDF downloads?  We could track direct image downloads, Word documents, 404 error pages, etc.  Share some other uses in the comments below!

Alex Moore

About Alex Moore

Alex Moore is our Analytics Department Lead. For over a decade he has built websites from the ground up, placing web analytics at the root of digital marketing strategy. Alex has spent eight years in the consultancy world and in client-facing roles, working on search engine marketing, development, web analytics and custom reporting. He also heads LunaMetrics training seminars across the country. The next seminar Alex will be leading will be a Google Analytics training in Houston. An avid independent backpacker and traveler, Alex also has his Masters in Religious Dialogue from Ireland, where he discovered that most sectarian conflict can be settled over a pint of plain.

http://www.lunametrics.com/blog/2013/06/04/tracking-pdfs-google-analytics-server-side/

34 Responses to “Tracking PDFs and Other Downloads Inside Google Analytics… Server-Side!”

Markus says:

Hello Alex,

very cool, because your solution also tracks direct entries to PDF docs. JavaScript cannot capture traffic on PDFs via Google results for example.

Alex Moore Alex Moore says:

Thanks, Markus! Clicking on a PDF directly from a link on the SERP is a perfect use-case for this script. Those visits can now finally be recorded!

Deri Haus says:

That’s really really cool. I’ll try it and I’m sure I’ll implement that in my projects.

Ben Meck says:

Alex – Love the methodology and the approach. We’ve experimented with this as well and have fun into a few issues that we’ve custom-tweaked to allow campaign tracking to flow through as well.

Issue #1: Campaign tracking and accurately reporting referrals. We can set the utmz cookie with parameters included in the PDF, (example.pdf?utm_source=test&utm_campaign=test&utm_medium=test) but are needing to do some custom work around tracking the actual referral for PDFs appearing in organic search listings. Any ideas there?

Issue #2: Right now we’re dumping that PDF traffic into a separate suite, just like you. Otherwise you’ll have inflated sessions and pageviews on your actual site. Long term – it’d be great to figure out a way to get these included with other site metrics & engagement.

Not sure if you’ve encountered those issues yet – but if you ever dig deeper into those, I could definitely see a Part Two on the horizon…

Thanks for the post,
>> Ben

Alex Moore Alex Moore says:

Ben, thanks for sharing. These are not easy questions!

Regarding Issue #1, I’m not sure what your own implementation looks like, but in php-ga you actually DO have the option to set campaign parameters. There are various methods, including setMedium($medium) and setType($type) which mimic some of the behavior of the utmz cookie. You can find more information in Campaign.php inside the php-ga source: https://code.google.com/p/php-ga/source/browse/trunk/src/GoogleAnalytics/Campaign.php

In theory, you could rewrite .htaccess to pass along any utm campaign parameters and then use download.php to check the query string for the parameters and pass them to the above methods.

Issue #2 is somewhat related I suppose. I’ve gone back into download.php above and added line 36:

$page->setReferrer($_SERVER['HTTP_REFERER']);

This sends GA referrer information for the PDF you are visiting, based on the visitor’s referrer data. That might help connect the dots, if a person clicks on a PDF from a link somewhere, on your website or elsewhere (including a search engine).

Hope that helps! There will definitely have to be a Part Two!

Hi Alex,

It’s very nice to see a post about server-side GA integrations that aren’t dependent on Measurement Protocol. Thank you!

Do you think that you could perhaps expand on the “cautionary word” you have above? Does PHP-GA inflate ‘visitor’ counts or just spawn new sessions? Since cookies can be read by PHP server side, the utma cookie should be enough to maintain visitor integrity, no?

As in the post:
$visitor->fromUtma($_COOKIE['__utma']);

In any case, whether it is sessions being inflated or visitors, I’m missing the utility of server-side tracking when co-deployed with client-side tracking if user behavior isn’t tied to the session. In a purely server-side GA set up, it sounds to me that the data would be fine. But if the user action (PDF download) can’t be tied to the session, what is the value?

One possible answer from above -> search traffic that directly accesses PDFs. Other possible answers?

Something that you may want to experiment with is to send the data to GA’s mobile processing pipeline. Changing the client side collection to use a MO-XXXXXX-Y web property ID and send the server-side hit to the same MO web property ID should maintain session integrity. Adding utmip as a hit parameter within php-ga should allow for accurate geo-location reporting as well. Unfortunately, it can be somewhat of a black box how utm.gif requests are actually processed on the GA servers and I’ve occasionally seen some wonky things data-wise, so I’m interested to know if the above method works for you as well.

Yehoshua

Alex Moore Alex Moore says:

Hello Yehoshua,

Thanks so much for your comment. The “cautionary word” is sort of a cop-out, to be honest. At the time of this posting, I hadn’t familiarized myself with php-ga nearly enough to know if $visitor->fromUtma($_COOKIE['__utma']) would work as expected, and I didn’t want to corrupt anyone’s main GA profile if the utma cookie wasn’t actually being passed into GA properly! For now, the cautionary word will remain a little ambiguous (if perhaps unnecessarily cautious).

Let’s assume that ->fromUtma does what it advertises. In that case, the utility of this script is easy to see — in the example I reference in my post, when a client is hot-linking to their own PDFs from inside an email campaign, it might still be possible to see if those visitors were new or return visitors. If your needs are specifically related to PDFs linked from search engine results pages, as you and Markus have both suggested, I could imagine someone modifying the script or adding an additional .htaccess rewritecond to only fire when the referrer is google.com, etc.

Thanks also for the comment about GA mobile processing. I had thought about this briefly, but realized that most readers probably wanted a solution as close to drop-and-go as possible, which meant: keep PDFs where they are, and keep existing tracking code as-is. This solution attempts to do those things.

Alex

Hi Alex,

Thanks for the reply. Yes, indeed there is great utility when server-side tracking is tied to the proper session and not as much if sessions are broken (which is why I brought up my question). We’ve been doing some session-level tracking with phone calls where the PBX sends a server-side hit; it is quite cool to see what actual actions the user took on the site before making their call. (Cool for us analytics geeks, that is). I do like the idea of tracking hot-linking from emails; that is a good point.

My experience with this server-side tracking in GA is that sessions are indeed broken unless processed via the mobile pipeline. While that doesn’t allow for a “drop-and-go” solution per se, the “heavy lifting” isn’t terrible. Still interested in hearing if that is validated by others, but for now happy to share my experience.

Measurement protocol ultimately will be much cleaner as sessions are calculated server-side, all one needs to pass is the clientId.

Thanks again for the post. You guys at LunaMetrics have consistently been publishing good stuff for years. +1

Yehoshua

I’m wondering if you mean that we should create a new PROPERTY in Google Analytics? I’m assuming this is the case, cause what we really want is a new property ID to send this data to (UA-XXXXXXX-3 for example). Or am I wrong?

Alex Moore Alex Moore says:

Yes, Paul. I originally meant to say you should create a new property for this server-side data. I have updated my post now. Thank you!

Kevin says:

Hi Alex, I would like to try this on a windows server do you know what I would need to do for step 3 as there is no htaccess file?

Alex Moore Alex Moore says:

Hi Kevin,

I don’t know for sure, as I have little experience with Windows servers, but you should have a web.config file in your root directory. Open that up and insert something like the following:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<system.web>
<rewrite>
<rules>
<rule name="PDF rewrite" stopProcessing="true">
<match url="^.+\.pdf$"/>
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="LunaMetrics123" negate="true"/>
</conditions>
<action type="Rewrite" url="/gasandbox/download\.php"/>
</rule>
</rules>
</rewrite>
</system.web>
</configuration>

Please post back if this works for you, as I’m sure many others would be interested as well! Thanks.

Stephane says:

Hi Alex,
When you say “create a new property”, you mean having two properties for the same website ? I though it was not possible !? when you look GG, they say “Installing multiple instances of the Google Analytics Tracking code on a single web page is not a supported implementation”.
So how can you manage to have a seconde property ?
Thanks

Alex Moore Alex Moore says:

Stephane, actually the quote that you’re including here from the GA documentation refers to the instances of tracking code that appear in the source code of a webpage. I agree that you wouldn’t want to include more than one tracking code on the webpage, but the script that I’m providing is all server-side PHP script, and as such it exists independently of the JavaScript tracking code you refer to in your comment.

Until we all learn exactly what session data is preserved between server-side and JavaScript implementations, I’d suggest that you create a separate property and set that UA-id in the $GA_ACCOUNT variable (line 11) above. You won’t be duplicating tracking code in the HTML of your webpage, and your old data will remain unaffected. Hope that helps. Thanks!

Stephane says:

Ok, thanks for your answer ;)
I tried, following your explanations but something must be missing …
Well, first I created a new GA-ID, and then wrote it as my domain name in the “download.php” file. I put that file to the root directory as well as the file autoload.php and the GoogleAnalytics folder.
Then I modified my htaccess file and add lignes you precised (I have PHP Version 5.3.16). I uploaded it and then tested by downloaded a pdf file from another website than mine (with the GA-ID). Then I look to GA but … nothing.!?

Marc says:

Cant wait to try this out on my unix sites. Grabbing the log file hits to PDF’s and then merging the Analytics data in excel is a pain. I have a bunch of clients whose top landing pages are PDF’s so organic traffic is not represented well by Analytics stats alone.

Jay says:

I also have followed these steps and I’m just not having any luck.

I’m wondering if it has to do with the htaccess file.

the last line:
# … rewrite all requests for PDFs to our PHP script
RewriteRule ^.+\.pdf$ gasandbox/download\.php [L,NC]

I put download.php in the root like you stated. When accessing my direct link to the my pdf I get an error.

I changed that line to:
RewriteRule ^.+\.pdf$ gasandbox/download\.php [L,NC]

Now when accessing the PDF it asked me to download. I downloaded it, checked my GA account in real time, and still 0 visits. I tested the GA tracking ID with a static page to make sure it was working and it is working properly.

Alex Moore Alex Moore says:

Jay, thank you for pointing out the fact that the line in .htaccess you are quoting should actually point to download.php in the ROOT, which is what the instructions said to do after all! My mistake.

The line you are quoting in .htaccess should actually say:

RewriteRule ^.+\.pdf$ /download\.php [L,NC]

Give that a try, and let me know if that fixes your issue. Thanks!

Jay says:

Hey Alex, thanks for the quick response. Still no luck though.
Does it matter if i’m putting these files in a sub-directory of my domain?

so lets say domain is
google.com/dev/

Should I still put all the files (htaccess, download.php, GoogleAnalytics folder etc…) into the folder “dev”?
Do I have to alter the htaccess file?

Thanks again

James says:

This is a really interesting option, and thank you for spreading the word, but I am having troubles. One the hit never shows up in GA, and second the pdf is a corrupt file.

I presume this is why the hit isnt showing up…
[Fri Jun 28 14:33:16 2013] [error] [client 128.135.215.18] PHP Fatal error: Uncaught exception ‘UnitedPrototype\\GoogleAnalytics\\Exception’ with message ‘Session::fromUtmb(): The given “__utmb” cookie value is invalid.’ in /opt/sge/html/GoogleAnalytics/Tracker.php:337\nStack trace:\n#0 /opt/sge/html/GoogleAnalytics/Session.php(90): UnitedPrototype\\GoogleAnalytics\\Tracker::_raiseError(‘The given “__ut…’, ‘UnitedPrototype…’)\n#1 /opt/sge/html/download.php(27): UnitedPrototype\\GoogleAnalytics\\Session->fromUtmb(NULL)\n#2 {main}\n thrown in /opt/sge/html/GoogleAnalytics/Tracker.php on line 337

Not really sure about the corrupted PDF, but one thing at a time I suppose.

Jared says:

I have implemented the code, but am getting a error

Parse error: syntax error, unexpected T_STRING, expecting T_CONSTANT_ENCAPSED_STRING or ‘(‘ in /home/ngjge/public_html/download.php on line 9

Any help or pointers would be greatly appreciated…

Jared says:

Line 9 of the download php file is:

use UnitedPrototype\GoogleAnalytics;

Ian Caldwell says:

Any reports about the server-side PHP causing issues for Firefox? Works great, finally tracking PDF views correctly, but I am getting reports about corrupt PDFs from users of Firefox… I haven’t been able to replicate the issue myself, but there do seem to be past issues with Firefox, PHP and caching. Any tweaks to the code that could help avoid this issue?

Hi Alex Moore, excellent and very helpful article.

I do have only a question. If I want track other downloads (e.g., ZIP), I only have to:
1. change download.php line 8 to:
header(“Content-type: application/x-download”);
2. create one more line in .htaccess:
RewriteRule ^.+\.zip$ /download\.php [L,NC]

Is that all?

Hi Alex Moore, it’s not working with ZIP files. I have done like I mentioned in my last post, but it is downloading a corrupt file. I guess the problem is in cUrl. Can you help me?

Kevin says:

Hi Alex, we implemented this today on a windows server following your instructions and so far seems to run like a dream. Thanks so much for the help.

Kevin

Kevin says:

Hi Alex, I spoke too soon. We are having a problem with some PDFs not displaying. It seems to be related to https secured PDfs or PDFs with a password on them. Do you have any idea why or a way around this. We also tried adding a web config file to a single directory which is not https rather than the root and none of the PDFs in this directory would then display.

Jesse says:

Thanks a whole lot! This makes it possible for me to track when my academic journal articles I have hosted on my website are accessed through Google Scholar.

I did change the php download script to register an event when a pdf is requested, rather than register as a pageview (since I already had an event when I clicked on a pdf from my webpage). Instead of
$page = new GoogleAnalytics\Page($filename);
$page->setTitle($filename);
$page->setReferrer($_SERVER['HTTP_REFERER']);$tracker->trackPageview($page, $session, $visitor);

I have
$event = new GoogleAnalytics\Event();
$event->setCategory(“DownloadPDF”);
$event->setAction(“requestPDF”);
$event->setLabel($filename);
$tracker->trackEvent($event, $session, $visitor);

The only problem I have is that the events are being reported from my webhost’s IP address (so it always registers as being from Indiana instead of its true location). Not a big problem, but have you any idea if there is a way to fix that?

Sebastian Richter says:

Her is the same. It’s not working with ZIP files. It is downloading a corrupt file. Any Idea?

Rafael says:

I am encountering the same error, but I am trying with an mp3 file. It will download and/or stream for a short period. And I end up with an incomplete/corrupted file.

Eugeny says:

Thanks for sharing, but your code not working when cookies unavailable…
For fix this need to add something like this

if(!empty($_COOKIE['__utma'])) $visitor->fromUtma($_COOKIE['__utma']);

Kib says:

Much nice is to use apache logging redirection to redirect the logging of a normal download to Google Analytics.

Gabriel Androczky says:

Hello there,

Very nice idea.
However, I also ran into errors. The following modifications are suggested:

$visitor->fromUtma($_COOKIE['__utma']);
should be changed to:
if(!empty($_COOKIE['__utma'])) $visitor->fromUtma($_COOKIE['__utma']);

$session->fromUtmb($_COOKIE['__utmb']);
should be changed to:
if(!empty($_COOKIE['__utmb'])) $visitor->fromUtmb($_COOKIE['__utmb']);

and

$page->setReferrer($_SERVER['HTTP_REFERER']);
should be changed to:
if(!empty($_SERVER['HTTP_REFERER'])) $page->setReferrer($_SERVER['HTTP_REFERER']);

This stopped the errors for me. Hope it helps someone. :)

Cheers,

Gabriel

Tor says:

I’ve been trying to implement this.

Out of the box, it worked perfectly in Firefox, and Safari. However, Internet Explorer would’t open the PDF files, giving a “file does not begin with %PDF” error. However, it still recorded the download in Google Analytics.

In researching this, I came across a web page that noted you couldn’t handle cookies before a file download or it would break in IE. On a whim, I reorganized the download.php file so all the CURL stuff came before the GA stuff. This continued to work in Firefox and Safari. And it got rid of the PDF file error in IE… but then IE wouldn’t transmit the download to GA. It seems like IE killed the PHP process before it could get to the GA stuff.

Any thoughts on a workaround?

Leave a Reply