Languages Report in Google Analytics/
February 12, 2013
The Languages report is among the most cryptic of the reports in Google Analytics. It looks like you need a secret decoder ring to figure out what it’s telling you. Here’s some guidance on what those codes are and what they mean.
What are they?
These language codes represent a language and optional country variant. (We’ll look at where GA gets these momentarily.)
The codes aren’t specific to Google Analytics; in fact, they’re based on two ISO standard specifications:
- ISO 639 specifies 2-letter and 3-letter codes for languages. See a list from the Library of Congress here.
- ISO 3166 specifies 2-letter and 3-letter codes for countries, as well as 3-digit codes. See a list here.
In most cases, languages in the Google Analytics report have a 2-letter language code (for example, “en” for English).
That may be all you see. Optionally, though, there may also be a 2-letter country code, with a hyphen separating the two parts (for example, “en-us” for US English).
Where does GA get them?
Google Analytics takes these values from the web browsers of visitors. Language is actually a user-selectable setting in most web browsers, generally defaulting to the language of the operating system. However, users can change the setting to reflect their preferences. Here’s an example of the setting in Chrome:
You can actually set several languages and rank them in order of preference. This preference can be used by sites to automatically choose a localized version through a process called content negotiation, giving you a translated version of the site in your most preferred language. Google Analytics simply reports the first preference in the browser.
Usage of country codes
In our experience, usage of country codes in the data in the Languages report varies widely. To quantify this, we took a sample of a recent 30-day period for a global organization with a website localized into several languages, containing data on approximately 1.4 million visits. Although every site will differ based on its audience, this sample gives us a wide set on which to make generalizations.
We found that usage of country codes does indeed vary by language. For the top ten languages for this site, English and Portuguese usually used a country designator like en-us or pt-br (95.0% and 92.5% of the time, respectively), and Chinese virtually always used one (99.9%). Other languages rarely use one, such as in simply fr or es (French only 12.7%, and Spanish, Arabic, Russian, and Japanese all below a third of the time).
Overall, we found that the median of the language country code usage was only 14.8%. In general, less populous languages rarely used country codes (with some exceptions). The overall average, however, was 71.8% because of heavy representation of English and Chinese in this sample. The proportion of languages for which country codes are present for any given site will vary with the language composition of the visitors, obviously.
If you page through your Languages report, you may find unusual language labels that don’t match the patterns above. Here’s what we found in our sample.
Missing data: (not set). Data for language was missing in few visits (0.02%). The primary culprits were Opera Mini and certain BlackBerry browsers.
Strange long hex labels. In 0.01% of cases, we saw strange long hexadecimal numbers: *30775594307752e1307755a430775578307753f0, for example. This stems from certain Blackberry browsers.
Misformatted labels. Labels misformatted in a variety of ways, including the following:
- using a separating hyphen even when no country code is included: en-
- using a separating underscore instead of a hyphen: en_us
- accidental capture of additional parameters: en;q=1.0
- inclusion of character encoding or other information: sr-latn-rs
The underscore was the most common of these, occurring in 0.14% of visits, and occurred in certain Blackberry browsers. The others are rare.
Invalid codes. We found a handful of codes that just don’t seem to exist: c, qcv, etc.
Undefined country code. There was one code with an anomalous country code: es-419. Although countries can be specified by a 3-digit numerical code under ISO 3166, 419 doesn’t correspond to any country. This actually accounted for a significant number of visits (0.95%) and was due to a particular version of Chrome that seems to have had a bug.
Three-letter language codes. The vast majority of languages we found in this sample of data were represented by 2-letter codes. However, ISO 639 allows for an expanded set of 3-letter codes for languages that do not have 2-letter representations. The only 3-letter language code we saw with frequency was fil (Filipino) at 0.14%, but this is an indicator that we need to be careful in filtering the language field because 3-letter languages may occur, and could vary depending on the site’s audience.
Filtering and segmenting language data
In most cases, if we’re interested in language data, we’re interested only in the language code and don’t care so much about the country. After all, we have the Locations report to tell us about where a visitor is physically located, and the differences in most cases between language variants in two countries are not so different (en-us vs. en-gb, for example). (That’s not always the case, however; Chinese variants may be not be mutually intelligible.)
To create filters or advanced segments, you can use “begins with” as a matching criterion:
(Note that you don’t want to include the hyphen, because sometimes it may not be there (no country code), and sometimes it may be an underscore instead.)
However, based on our findings above, we recommend instead that you filter on two-letter languages where the string ends or continues with a hyphen or underscore using a regular expression:
This ensures that you don’t run into any problems with 3-letter language codes. (For example, begins with “fi” for Finnish would also include “fil” for Filipino. The regular expression excludes that.)