Data sampling in Google Analytics (GA) has always taken place, and occurs by default - most commonly when running a report containing <strong>more than 500,000 visits</strong> and requesting advanced segments or a combination of dimensions which GA doesn't store pre-calculated by default – particularly, any custom segmentation, but also commonly-used reports such as keyword and top content reports.
Data sampling in Google Analytics (GA) has always taken place, and occurs by default - most commonly when running a report containing more than 500,000 visits and requesting advanced segments or a combination of dimensions which GA doesn't store pre-calculated by default – particularly, any custom segmentation, but also commonly-used reports such as keyword and top content reports.
Another less common trigger for sampling is any query that exceeds 1,000,000 unique dimensions (e.g. keywords, URLs, etc). In this case, GA serves up a sampled report containing 1,000,000 unique URLs (or any other dimensional value) for a given request, divided by the number of days requested; for example, a three-month timeframe would give you a sum of 1,000,000/90, so 11,111 unique URLs. Even if all you’re interested in is a single URL or keyword, with a number of visits far below these thresholds, the data may still be sampled because GA has to process all visits to the site in the date range given. A recent addition to the GA interface is the ability to adjust sample size from the default 250,000 visits up to 500,000 visits for greater accuracy, or down to 1,000 for faster processing of results, but with lower precision; a smaller sample size means a greater margin of error. Google used to return a confidence interval (eg. ±5 per cent) against each sampled metric as a statistical measure of the accuracy of sampling, but this value has long since been removed as "it was thought to be confusing". So, how accurate is the sampled data? A quick bit of analysis on one website with 1.5 to 2 million visits per month shows that… well, it depends. Below, I’ve compared the true metrics for two PPC campaigns for the first three months of the year, determined by exporting unsampled data to Excel in nine to 12-day batches before recombining them with the sampled data. For a campaign with high traffic volume (12 per cent of all site traffic) the results weren’t bad even at the default sampling level, overestimating the number of visits by just one per cent, and sales and revenue by five per cent. When my sampling preference was switched to maximum precision, the margin of error decreased even further. Using this data to optimise a campaign should be okay. However, for a low volume campaign (0.04 per cent of total traffic), the results were way off. Even at the highest sample size, the visit count was still underestimated by almost four per cent. More seriously, not one of the nine sales in this three-month period was picked up. Using this data to make decisions on this campaign’s future would be disastrous. Sampling is very accurate for top-level metrics like total visits, total page views and such, and for shorter data ranges or lower traffic sites it may not be a problem at all. However, but if you have a high traffic website and a desire for granular data then you need exact metrics rather than a statistical sample. There are a few options to consider when trying to reduce Google’s use of sampling, if not eliminate it entirely:
- Move the slider provided towards the ‘higher precision’ end of the scale. If total site visits are less than 500k in the selected date range, this alone could be enough to eliminate sampling
- Select date ranges with fewer than 500,000 visits. Data from successive periods can be exported and recombined in Excel if necessary - but this is time-consuming and impractical for sites with very high traffic volumes
- Try and use the default reports rather than advanced custom segments whenever possible
- Create multiple profiles that track a smaller part of your site, e.g. PPC-only traffic. Sampling occurs at the profile level, so within these profiles you won't hit the 500k limit as soon as in the main profile
- Google Analytics Premium allows access to unsampled data for those willing to pay for it
- Grab the data from GA using the API, or a third-party tool that will do it for you such as Excellent Analytics, a free and easy to use add-in for Excel. Apparently some of the API data is sampled depending on the query, but in my own testing discussed above, this was not the case: Excellent Analytics’ reported figured agreed 100 per cent with the true vaules.