Loading Jaywing website
18 November 2019 / Opinion

How to tell if your A/B tests are statistically significant (and what to do if they're not)

Alice Clayton

A/B testing and experimentation in CRO & UX projects are more common than ever, but many marketers are still lacking in statistical know-how.

Understanding A/B testing statistics are vital for running tests correctly, rather than errors and unreliable results leading to months of testing producing no improvements in conversion rate (n.b. I’ve included some links to further in-depth reading on A/B testing statistics at the end of this article).

So what is the statistical significance? And why is it important in A/B testing?

Optimizely states, “Statistical significance is the likelihood that the difference in conversion rates between a given variation and the baseline is not due to random chance.”

The higher the significance level, the more confident you can be that your results are real and not due to a random error.

If your A/B test has a significant level of 95%, this means that you can be 95% confident your results are not down to chance but are real. It also means there is a 5% chance you are wrong.

Reaching statistical significance in A/B testing is important because when making business decisions based on the results of a test, you want to be sure that the results are actually real.

How to calculate whether your results are statistically significant

A/B testing tools will often tell you once your results have reached statistical significance. However, be wary of tools declaring statistical significance very early on - A/B test results can actually fluctuate between significance and non-significance throughout the course of the test.

That is why before ending your test you should ensure it has run for full weeks and at least two business cycles (at least two weeks). This is to gather enough data and avoid skewing the results, as conversions can fluctuate throughout the week.

You can also use tools such as this one to calculate whether your results are statistically significant. Input the number of days your test has been running for, and how many sessions/users and conversions each variant received.

The tool will then calculate the statistical significance, using both Bayesian and Frequentist methods (you can read more about that here).

Why hasn’t my test reached statistical significance?

Statistically significant results depend on two factors: 1) sample size (traffic levels) and 2) effect size (the difference between conversion rates).

If your results are not statistically significant, it could be that:

1. Your sample size is not large enough

The larger your sample size, the more confident you can be in the results of your experiment.

Higher traffic levels means the quicker you will have enough data to determine if there is a statistically significant change in conversion rate between the control and variation.

If your site has low traffic levels, you’ll need to run your test for longer to reach a large enough sample size. A sample size that is too small will lead to sampling errors.

2. Your effect size is too small

If your effect size is very small (such as a < 1% increase in conversion rate), you’ll need a very large sample size to determine whether the result is significant.

The larger the effect, the smaller the sample size needed. It’s possible that testing only small changes between variations (such as minor copy amends) doesn’t have a big enough impact on conversion rates, so the effect is too small to be detected.

What should you do if you haven’t reached statistical significance?

1. Run your test for longer

If you suspect your test has not reached statistical significance due to insufficient sample size, you could try running your test for a few more weeks.

Using a sample size calculator such as this one or using the ConversionXL test analysis calculator will help you understand how long you need to run your test for in order for the results to reach statistical significance.

Be wary of running your tests for too long, however, as you could risk sample pollution, which is where a test runs for so long that external factors begin to influence the results (such as holidays, technical issues, promotions).

2. Dig deeper into your results

We strongly recommend integrating your testing tool with your analytics tool, such as Google Analytics. This will allow you to conduct an in-depth analysis of your test data, which could reveal unexpected results.

Segmentation can be used to discover positive, significant results where previously there were none.

For example, across all devices, it could appear that you have no significant results, but when you segment your results by device you could have statistically significant results for mobile but not for desktop or tablet, which is affecting the overall results.

It’s important to note with segmentation, however, that you still need a large enough sample size for each segment within your sample for the results to be statistically significant.

You should also look at the micro-conversions and engagement statistics, rather than just macro-conversions.

Whilst you may not see a significant increase in transactions, if you look at other metrics you could see there has been a significant reduction in bounce rate or a significant increase in progression through the checkout stages.

3. Utilise other tools for further information

You can look to other sources of data to get more information on what happened in your test, and potentially indicate why your results are not statistically significant.

Heatmaps can tell you how user behaviour differed between variations (if it differed at all).

Some testing tools already use heatmaps, but you can also integrate tools such as Hotjar, which has the added benefit of giving session recordings also.

If you’re really struggling to understand how users interacted with your test, an online user testing tool such as UsabilityHub can allow you to gain some insight.

Using preference tests, you can show users static images of your variations, allow them to pick their preference, and ask them why they made this choice.

Obviously this has many drawbacks, including that these users are not actual users of your site (although you can use demographic targeting to get as close as possible), they haven’t experienced the journey leading up to the test, and viewing static images is not the same as experiencing the test on the live site.

However, it can still provide some extra insight that could help you understand your test results.

So, how can you make sure your tests are statistically significant?

1. Calculate your sample size before running the test

Before you begin testing, you should calculate a sample size for your test. You can use tools such as this one to calculate your required sample size.

You’ll need to enter your current conversion rate and the Minimum Detectable Effect (MDE) you wish to detect (for example, you wish to see at least a 10% improvement in conversion rate). The smaller the MDE, the larger the sample size per variant you will need.

Once you have your required sample size, you can use data on how many weekly visitors the area of your site you will be testing on receives to calculate how long to run your test for. The lower your weekly traffic levels, the longer you will need to test for.

By calculating sample size in advance, you will know how long you need to run your test for to reach statistical significance.

2. Run tests on higher-traffic or higher-impact parts of the site

If you struggle to reach sample size without running tests for months on end, try focusing only on the parts of your site with the highest traffic, or where you would be most able to see the biggest impact in change to the conversion rate.

This could be an area of the site that needs serious improvement, or high value areas such as top-selling products.

3. Run tests that will have a bigger impact

Make sure you are testing meaningful changes in order to see significant results, particularly if your site does not have a lot of traffic.

Rather than minor amends, such as label changes or copy tweaks, test large, noticeable changes such as different offers, page layout redesigns or new functionality.

4. Choose your tests based on insight and data

Don’t just test something because you saw it in an article about ‘A/B tests guaranteed to make you lots of money’. What works on one website won’t always work on yours.

Conducting regular insight such as heuristic reviews, analytics data reviews, user research, behaviour analysis and more will provide you with testing ideas that are supported by solid evidence.

These tests are more likely to have a real impact on your conversion rates than something randomly chosen that doesn’t actually apply to your industry or users.

To summarise…

In an ideal world, all A/B tests would have positive, significant results, where the variation is clearly the winner.

Sadly this isn’t always the case, but taking measures such as ensuring a large enough sample size, running tests for long enough and using data to support test ideas that will have a large impact will increase the robustness of your testing programme.

Further reading:

A/B testing statistics & statistical significance: