Two out of Three of all A/B tests are not valid

I admit that this statement launches terror. So first let me calm you down. It isn’t that the concepts are bad or that A/B testing is unreliable in general. The erroneous results are much more likely to be traced back to something that can actually be avoided– namely by making sure to that test runtimes are adequate.

Results are subject to great fluctuations right at the start of testing and they only stabilize during the course of time and approach actual conversion rates. If we consider the following example of an A/B test with four variations, we see that the results of the variations converge again after a certain time.

Source:
http://www.qubit.com/sites/default/files/pdf/mostwinningabtestresultsareillusory_0.pdf

If tests are shut down too early, results are frequently displayed that actually show only the momentary perspective. The true outcome – like an unforeseeable twist in a movie – can’t even be predicted to any degree.

With this article, I would like to hand you the tools in order for you not to shut down your test too early and to make certain that it belongs to the one-third of tests that provide valid results.

How can I also verify uplifts for my variations with certainty?

Let’s imagine that we we’d like to find out whether there is a difference between men and women’s shoes sizes and to prove that using a test. If we now draw only upon a few men and a few women for the sample, we take the risk that we will not discover that men on average have larger shoe sizes than women. It could well be that, by chance, we have selected for our test men tending to have smaller feet and women with larger feet, and have already come to a conclusion with a sample that is too small.

The larger the sample, the greater also the probability that the sample “stabilizes” and we can verify the actual difference between shoe sizes in men and women. A larger sample ensures that we obtain a reliable image of the actual shoe sizes for men and women and that we also find the real, existing difference through the test.

The sample in an online business

In online business we are naturally interested in whether our created variation is better than the previous variation (control variation).

Example: Uplift with coupons
Let us now simply assume that users in one variation get a 10€ discount on their current order, which is not available in the control variation.

The hypothesis:
As a result of the additional monetary incentive of 10€, the motivation to carry out an order is increased, whereby the number of orders in turn also increases.

The result:
After a 30-day test runtime the testing tool features a significant uplift of 3%. Had we let the test run for 14 days only, we would have determined an uplift of exactly 1%, which would not have been significant.

The insight: Longer = more validity! But how long, exactly?
In principle, it can be said that the longer a test runs, the higher the probability of verifying a true difference with the test. In our case, that men have a larger shoe size than women.

If this effect – as with our illustration of shoe sizes – is very clear, then even a smaller sample will produce valid results. If, however, the effect is very slight, as with the example of the website with an uplift of 3%, then the sample must be many time larger in order to verify the effect with a particular degree of certainty.
If our sample is too small, we take the risk that we will not discover the existing difference, even though there is one. How large the sample must be at a minimum in order to also be able to verify the effect can be determined with the help of the statistical power calculation:

The statistical power calculation

The statistical power is the probability of being able to verify an uplift that also actually exists by means of an experiment.

The greater the power, the more likely it is to be able to significantly verify an uplift that actually exists by means of an A/B test.

As a rule, we refer to a powerful test if it has a power of at least 80%. This means that the probability of verifying an uplift that actually exists is 80%. In the reverse, there is also still a 20% risk that we will not verify an uplift that actually does exist. Perhaps you have already heard of the so-called “beta error” or a “false negative” (type 2 error).

It is as if Christopher Columbus had a 20% risk of sailing past America and therefore not discovering a new land that was actually there.

If the power of the experiment is too low, however, we not only run the danger of not verifying real uplifts. Even worse, we shut down an experiment because it features a significant winner that, in reality, is none at all. If such a circumstance arises, we refer to an “alpha error” or a “false positive.”

In this case Columbus would have thought he landed in India, although he discovered a new continent.

If we shut down an experiment too soon, that is, as soon as the testing tool shows a significant uplift, the error rate is at 77%. This means that the probability that the measured effect has arisen purely by chance is 77%.

Ton Wesseling: You should know that stopping a test once it’s significant is deadly sin number 1 in A/B-testing land. 77% of the A/A-tests (same page against same page) will reach significance at a certain point (Source).

Perhaps you are familiar with the following situation from your everyday optimizer world:

The test carried out contributed an uplift of 10% onto the orders. The results are significant after ten days (confidence > 95%) and the test is shut down. In reality, the type of result that you can only hope for.

In order not to lose time, the test is shut down immediately, the concept is quickly implemented by the IT department and the original is replaced with the new one. Now, all you have to do is wait until the uplift also translates into the numbers. 🙂

But what if the 10% uplift simply doesn’t appear under real conditions and the numbers show a very different picture? Namely, it can be that the uplift will not to be seen at all! In principle, the numbers remain exactly as they were previously – that is, unchanged.

This can be due to the fact that the test has simply been shut down too early. If the test had just been allowed to run somewhat longer, it would have been determined that the conversion rate for both variations realigned. Also, the significance that was determined earlier is gone. This effect is clarified by the following graphics:

What happened in the test?

If we had carried out a power calculation before the test we would have discovered that our test had to run for one month before we reached a power of 80%. However, the test was then shut down after only ten days because the result was significant. In your defense, the testing tool also had already referred to a winner, for which reason you shut down the test with the best of knowledge and belief. The crux of the matter: the statistical power at that point in time was just 20% and the testing tool also did not show this.

Calculation for the minimal test runtime for valid results

In order to discover how long a test has to run at a minimum to obtain valid results, we must carry out a calculation considering the statistical power even before the start of the test:

These factors determine the minimal test runtime:

Conversions / month – this is the metric on which you would like to optimize. As a rule, it is the orders. For exact planning, you should, when possible only consider the conversions here that were also previously on the test page.

For example, let’s carry out a test on the product detail page. Within one month 3,000 visitors have placed an order. Of these 3000 users, however, only 2,600 were previously on the product detail page. Four hundred orders were placed either directly from the shopping cart or via the category page. In this case, the conversions / month = 2,600.

The conversion rate for the test page – in this case the issue is the ratio of the persons who were on the test page and purchased, to those who did not purchase.

Example: The product detail page was visited by, in total, 15,000 users within the month. This means that the conversion rate for the test page = 2,600 conversions / 15,000 visitors on the product detail page (17.3%).

The number of variations for the test – this information is important because the runtime is extended depending on the number of variations. The more variations included in the test, the longer the test runtime.

Confidence – How certain would you like to be about your test results? Confidence indicates to what extent you are prepared to take the risk of chance effects. In this regard, you should select the confidence that you otherwise also use to interpret your test results. As a rule, this is 95%. This means that the risk of finding a random effect that, in reality, does not exist is only 5%.

Power – Power expresses the probability with which you can verify an actual, existing uplift by means of an experiment. The value of 80% is often the goal.

Expected uplift – This is the effect on the conversion rate that we expect from the test concept. Here, you should enter a value that is realistic based on your previous testing experience for such a test. The closer you are to this value with your estimate, the better you will also estimate the power of the experiment and obtain valid data. If the issue involves a test, the outcome of which you cannot estimate at all, I recommend estimating an uplift that you would like to achieve as a minimum in order to then speak of a successful test, one that you would also like to roll-out in follow-up.

Tools for calculating power

Because testing tools, as a rule, unfortunately do not carry out power calculations or offer it to their users, it becomes incumbent on you to carry this through. There are a number of tools available on the web to calculate the statistical power. One of the best known is gPower.

These tools are usually somewhat difficult to interpret and are not suited to the workday-world CRO. Fortunately, for us optimizers, there are also tools that are clearly better tuned to our CRO process to calculate test runtimes.

It can even work more elegantly: Iridion, developed by konversionsKRAFT itself, hands optimizers the opportunity to do a power test with its help from two perspectives. In addition to the “normal” test runtime calculation, you can also analyze whether it is even possible to achieve a valid result in a predetermined test runtime.

After having input the appropriate information for the experiment, an individual recommendation is given on how long the experiment must at least run, or whether you have to reduce the number of variations or extend the test runtime in order to obtain valid results.

Successful test planning in five steps

Define how much uplift your variation must at least achieve so that you can refer to a successful test ––> That is, that you would also like to build the variation and roll it out.
Determine how certain you wish to be of really discovering an actual uplift (power). In other words, to what extent are you ready to discover an uplift that exists it reality?
Plan a realistic number of variations for your experiment. The less traffic you have available, the fewer variations you should use in your experiment.
Calculate the minimal test runtime and the minimal number of conversions and visitors for your test and keep this firmly in view.
Do not shut the test down until you have fulfilled the minimal test requirements.

Conclusion:

You would like your test to belong to the one-third of tests that provide valid results? By having a plan that is thought-through before the testing, you rule out the possibility that your experiment will not provide valid results due to too short a runtime.

The validity of the test result is assured through the help of a test runtime calculation. How certain this should be exactly, of course, depends on individual judgment. The level of the desired power and the accepted error probability can be established individually per test.

Find a good middle path between validity and business interests.

Of course, a high level of validity is associated, as a rule, with longer test runtimes and thus higher costs. Eventually, everyone has to decide for him- or herself at what point the highest profitability of A/B testing is achieved and a low residual risk of error is taken into account in order to be able to thereby save time and cash. This small window of maximum profitability from A/B testing is handled by the U-model of validity.

Start structuring your Experiments today!

Try iridion for free as long as you want by signing up now!

Create your free Optimization Project