Upcoming LunaMetrics Seminars
New York City, Aug 4-8 San Francisco, Aug 11-15 Los Angeles - Anaheim, Sep 8-12 Washington DC, Sep 22-26

Languages Report in Google Analytics

The Languages report is among the most cryptic of the reports in Google Analytics. It looks like you need a secret decoder ring to figure out what it’s telling you. Here’s some guidance on what those codes are and what they mean.

Screen Shot 2013-02-11 at 11.06.48 AM

What are they?

These language codes represent a language and optional country variant. (We’ll look at where GA gets these momentarily.)

Screen Shot 2013-02-11 at 11.07.57 AM

The codes aren’t specific to Google Analytics; in fact, they’re based on two ISO standard specifications:

In most cases, languages in the Google Analytics report have a 2-letter language code (for example, “en” for English).

That may be all you see. Optionally, though, there may also be a 2-letter country code, with a hyphen separating the two parts (for example, “en-us” for US English).

Where does GA get them?

Google Analytics takes these values from the web browsers of visitors. Language is actually a user-selectable setting in most web browsers, generally defaulting to the language of the operating system. However, users can change the setting to reflect their preferences. Here’s an example of the setting in Chrome:

Chrome Language Preference

You can actually set several languages and rank them in order of preference. This preference can be used by sites to automatically choose a localized version through a process called content negotiation, giving you a translated version of the site in your most preferred language. Google Analytics simply reports the first preference in the browser.

Usage of country codes

In our experience, usage of country codes in the data in the Languages report varies widely. To quantify this, we took a sample of a recent 30-day period for a global organization with a website localized into several languages, containing data on approximately 1.4 million visits. Although every site will differ based on its audience, this sample gives us a wide set on which to make generalizations.

We found that usage of country codes does indeed vary by language. For the top ten languages for this site, English and Portuguese usually used a country designator like en-us or pt-br (95.0% and 92.5% of the time, respectively), and Chinese virtually always used one (99.9%). Other languages rarely use one, such as in simply fr or es (French only 12.7%, and Spanish, Arabic, Russian, and Japanese all below a third of the time).

Usage of Country Codes by Language

Overall, we found that the median of the language country code usage was only 14.8%. In general, less populous languages rarely used country codes (with some exceptions). The overall average, however, was 71.8% because of heavy representation of English and Chinese in this sample. The proportion of languages for which country codes are present for any given site will vary with the language composition of the visitors, obviously.

Data issues

If you page through your Languages report, you may find unusual language labels that don’t match the patterns above. Here’s what we found in our sample.

Missing data: (not set). Data for language was missing in few visits (0.02%). The primary culprits were Opera Mini and certain BlackBerry browsers.

Strange long hex labels. In 0.01% of cases, we saw strange long hexadecimal numbers: *30775594307752e1307755a430775578307753f0, for example. This stems from certain Blackberry browsers.

Misformatted labels. Labels misformatted in a variety of ways, including the following:

  • using a separating hyphen even when no country code is included: en-
  • using a separating underscore instead of a hyphen: en_us
  • accidental capture of additional parameters: en;q=1.0
  • inclusion of character encoding or other information: sr-latn-rs

The underscore was the most common of these, occurring in 0.14% of visits, and occurred in certain Blackberry browsers. The others are rare.

Invalid codes. We found a handful of codes that just don’t seem to exist: c, qcv, etc.

Undefined country code. There was one code with an anomalous country code: es-419. Although countries can be specified by a 3-digit numerical code under ISO 3166, 419 doesn’t correspond to any country. This actually accounted for a significant number of visits (0.95%) and was due to a particular version of Chrome that seems to have had a bug.

Three-letter language codes. The vast majority of languages we found in this sample of data were represented by 2-letter codes. However, ISO 639 allows for an expanded set of 3-letter codes for languages that do not have 2-letter representations. The only 3-letter language code we saw with frequency was fil (Filipino) at 0.14%, but this is an indicator that we need to be careful in filtering the language field because 3-letter languages may occur, and could vary depending on the site’s audience.

Filtering and segmenting language data

In most cases, if we’re interested in language data, we’re interested only in the language code and don’t care so much about the country. After all, we have the Locations report to tell us about where a visitor is physically located, and the differences in most cases between language variants in two countries are not so different (en-us vs. en-gb, for example). (That’s not always the case, however; Chinese variants may be not be mutually intelligible.)

To create filters or advanced segments, you can use “begins with” as a matching criterion:

Begins with filter

(Note that you don’t want to include the hyphen, because sometimes it may not be there (no country code), and sometimes it may be an underscore instead.)

However, based on our findings above, we recommend instead that you filter on two-letter languages where the string ends or continues with a hyphen or underscore using a regular expression:

Filter by RegEx

Copy-and-paste-able: en($|[-_])

This ensures that you don’t run into any problems with 3-letter language codes. (For example, begins with “fi” for Finnish would also include “fil” for Filipino. The regular expression excludes that.)

 

Jonathan Weber

About Jonathan Weber

Jonathan Weber is the Data Evangelist at LunaMetrics. He spreads the principles of analytics through our training seminars all over the East coast. The next seminar he'll be leading will be a Google Analytics training in Boston. Before he caught the analytics bug, he worked in information architecture. He holds a Master’s degree from the University of Pittsburgh School of Information Sciences. Jonathan’s breadth of knowledge – from statistics to analysis to library science – is somewhat overwhelming.

http://www.lunametrics.com/blog/2013/02/12/languages-report-google-analytics/

9 Responses to “Languages Report in Google Analytics”

Great post! One of the thing I do is create a “simplified language” advanced filter as follow:
Field A: Language Settings, ^(.+)[\-_]
Output: Language Settings: $A1

This way, only get the 1st part of the locale (like EN, FR, etc.) instead of EN-US, EN-CA, EN-UK, etc. This gives me a better appreciation of language distribution without having to use advanced segments or filters, and I can always use the demographic location to segment.

Jonathan Weber Jonathan Weber says:

Stephane – yes, even better to consolidate the data before you see it in the report. Thanks!

NicoDavila says:

I thought that es-419 is about “latin america”. It’s not about any country but it’s about latin america speakers. Am I right?

Jonathan Weber Jonathan Weber says:

Thanks for the catch on “419″ — it’s not in the list of country codes because it’s… not a country! Nevertheless, the usage seems to be unusual; only certain versions of Chrome seem to make use of it.

Jo Woodfine says:

Is there a way to find out what the language code “C” designates–you mention in your post that it is an invalid code? The “C” language is coming in from a not set location (city or country)and I can’t determine anything else. Thanks!

andrew says:

Hi

I have few sites targeting to italy and french.

One site which is targeting to italy,when i check the google analytics of this site, shows the traffic from as follows

IT
IT-IT

Same thing with french sites

FR
FR-FR

So where i want to know what is the different between it and it-it, fr and fr-fr.

Please help me, looking for your help. I have bookmarked your post will check again it for your reply..

Tks

Rumble says:

Interesting stuff. So I am in Scotland on business, with an American computer but I live in France but am originally from the UK. My firefox browser shows en-Gb as it is a preference. I wonder when this was set. But is I fill in a form they will think I am a UK user but I am not.
I am interested in targetting Expats so they will probably appear as users from where they were originally from.

Andrew — that was covered in the original post. The country parts are optional and some browsers include them, some don’t. There is no meaningful difference between it and it-it, or fr and fr-fr (except you can be sure fr-fr is not fr-ca).

Rumble — keep in mind we also have the locations report, which reports where you are. So the combination of language and location may be helpful to you in sorting things out. If a user travels about, your location will update (whereas your language most likely will not).

Alex says:

Hi Johanthan,

Thanks for this article.. very in time..
This reg expression very handy..

I could create 3 filters for my 3 main languages Est, Eng and Rus

However that’ gives me separate from each other results..

Is there any way to create a custom report that would display e.g a pay chart, that calculated only by Totals of each language filter?

What I wanna to achieve is to be able to see what’s the ratio or better say relationships between this three languages..
hope that makes sense :)
Could you suggest me where I could learn about it?

Looking forward to hearing from you

Regards.
Alex