Why you're ending your A/B testing too early

What does a perfect test runtime look like? Is there even the perfect runtime? What factors ultimately influence the test and its runtime? Is it even possible to predict the runtime at the preliminary stage? When should I stop a test? Without intending to ease the suspense right off– there is no perfect runtime, at least not in practice. The list of possible influencing factors is long. If the essential (known) aspects are broken down, however, then three different groups finally emerge:

Statistical influencing factors (e.g. sample size, number of variations, measuring method, error tolerance, etc.)
External influencing factors (e.g. traffic source, time-to-purchase, frequency of usage, campaigns, season etc.)
Economic influencing factors (e.g. budgets, positive/negative effects, costs-benefit relationship, etc.).

These already account for three aspects that significantly determine how long a test should run. So far, so good. These three should be considered somewhat more closely in order to be able to derive your own strategy. The easiest one to describe is the statistical power. It is based on clear mathematics, measurable and foreseeable.

Aspect 1: the statistical test power

In this context, a traffic duration calculator is usually used in order to be able to predict a runtime. To avoid getting lost in the dry basics of statistics: In principle, the issue in this case concerns a mathematically modified form of the calculation of the test power. The test power indicates how probable it is to prove a difference between a null hypothesis (“there is no difference”) and a specific, alternative hypothesis (“there is a difference”). If the test power is high, a difference exists; if it is low, there is none. The greater the test power, the more likely it is to prove an actual effect (uplift). If this uplift is very big, a short test runtime is adequate for a good test power. If, however, the effect is small, then (much) more time is correspondingly required. To ultimately calculate the power, three things are needed:

The ?-error (significance level, usually 95%),
The sample size and
the expected uplift (effect)

The calculator, in other words, does nothing more than convert this formula. The sample size is thus calculated based on the existing conversion rate, the effect, and the test power. As a rule, the latter is thereby accepted at 80% (many tools allow selecting the value). Because the required sample size per variation is now known for the desired uplift, the runtime can be calculated quite simply by the number of variations and visitors per day. At this point the attentive reader should already have arrived at the crux of the calculation: The most important influencing factor is the expected uplift (effect) – but it is precisely this value that is speculative! #1 intermediate conclusion Thus, from a statistical point of view the size of the sample and thereby the runtime for the test endure or fail due to the expected uplift. The higher the expected uplift, the smaller the corresponding sample must turn out. Okay, but why is this only half the truth? As already mentioned at the start, three aspects essentially influence the test runtime. The second – and clearly less tangible – is the external influencing factors.

Aspect 2: What are all of the influences on a test?

Is the number of samples large enough, in other words: If (very) much traffic is available, then – from the point of view of statistics – a test does not need to run for a particularly long time. This is precisely so when few variations are tested and the expected effect (also called impact) is high. The problem in this case is that, in the online world, you have no influence on the type of test participant. The selected sample is thus not representative at times. There are numerous examples that can be shown where variations run particularly well or poorly during the initial days, but the change (completely) in due course – despite a large sample at the start. Three aspects that have a determinative effect are the different behavioral patterns, type of usage, and surroundings variables. Thus, a series of factors can have an influence, such as:

The day of the week (weekend vs. weekday),
Situation (work place vs. couch vs. commuting),
Mood (good weather vs. bad),
Prior knowledge (existing customers vs. new),
Motivation (promotion versus recommendation),
and many others

In other words, for example, participants in a test during the week can behave entirely differently from users on weekends. The “entire traffic mix” must be suitable, in particular for tests that are not set up for a special channel. Otherwise, in a doubtful case, the momentary test record will include everything but the norm. A test should demonstrate itself to be robust enough for changes in traffic. A further component of behavior is the time-to-purchase. The time period differs, depending on the business model, the branch or the products. If higher-priced or consultation-intensive products are at issue, then the user, for example, can participate in the test a number of times across a number of weeks before he triggers a conversion. If the test runtime is selected to be short, then the conversion may occur outside of the test. There is also a sticking point here, due to the technology – so-called traffic pollution. If the cycle provides that the test must run longer, or if this is due to the already mentioned statistical grounds (power, sample size), then a longer runtime can distort the results. According to Ton Wessling – Testing Agency CEO – 10% of participants thus delete their cookies within two weeks, for example, and in the process lose their allocated test variations. But a change in the end user device (different desktop PCs, tablets or mobiles) also leads to variation loss. The longer a test runs, the higher the risk that the test participant variations blend and a valid result is thereby delayed. The type of conversion goal plays a role, in particular for online shops with heterogeneous product assortments (bolt versus high-load shelving). Thus, for non-binomial values (e.g. revenue) extreme values (particularly high number of orders) influence the runtime considerably in case these are not filtered (depends on the tool). #2 intermediate conclusion The most varied external factors, which cannot all be predicted, influence a test. For a representative sample, a test must run long enough to be able to display various behavioral patterns and types of usages. The final influencing factor is an economic aspect. Unlike the already presented aspects, this has no direct affect on the runtime but is relevant to a decision on the strategy.

Aspect 3: Stated with some exaggeration – how much will the truth cost me?

The economic aspect finally also determines how long a test should run with positive, negative or possibly no effect. Is it monetarily justifiable to allow a variation with significant uplift to continue running?

The control burns cash, so to speak, doesn’t it?

Even worse:

The variation is significantly worse, wouldn’t it be better if I shut it down, to avoid burning cash?

One more classic:

The variations converge. Does it make sense to let the test continue to run?

Although the answers to these questions can be deduced based on statistics and external influencing factors, this is always a question of judgment. The question is, Where should the focus be placed? On the most rapid test result possible that is shown to be adequate to reach a valid decision:

I want to know whether the variation is better

Or does the result need to be as accurate as possible and be safe from statistical coincidence as well as it can be:

I want to know by exactly how much the variation is better

The calculation can determine this even in the preliminary stage: At 80% test power and one-tailed testing, the test will clearly lead to a result more quickly, for which exact reason test tools in the most favorable price range apply this combination. At 95% and two-tailed testing, it takes correspondingly longer but, in turn, is more precise. Example shows the influence of the test power and testing on the sample size (and thus on the test runtime.) Just calculating the test runtime alone allows economic aspects to provide the decisive tip of the scales. For example, based on the expected uplift and available traffic, the test must run so long that it is uneconomical to even carry out the test.

Then what is the right strategy for the test runtime?

If there is an awareness of all of these influencing factors, then you should create an appropriate test runtime formula or a stop-test rule. Unfortunately, there is no panacea for this. Aside from the “hard” statistical factors, all other factors depend on the respective test situation. In the final analysis, the test hypothesis can “only” fail to function or not feature adequate contrast. There are several indicators that can help for your own formula under the pretext of the representative sample:

In order to be able to display as many different usage cycles and types as possible in the test, it should run over entire weeks (e.g. Mon. to Mon.).
As many as possible various users should take part in the test, e.g. existing customers and new customers (various traffic sources, campaigns such as newsletters, TV spots, etc.). Segmentation here is important in order to interpret the results and to be able to gain deeper insight.
The test should run a longer period of time, even if high traffic is available. The traffic can be reduced under certain circumstances (traffic allocation), in order to be able to test for a longer period of time (reduce costs for traffic volumes).
Don’t stop the test “too early.” The statistical and external influences kept in mind require test time, This is equally true for positive and negative results! Just because the tool displays 95% significance or higher after a short time does not mean that you can now immediately stop and celebrate.
Consider the time to purchase and possible adaptation factors (existing customers). The test should at least include one cycle.

Conclusion

This article should not lead to the impression that the test runtime is arbitrary and cannot be planned – to the contrary. If you have become aware of the aspects presented here, then an estimate of how long a test should run can be made, even in the preliminary stage. The more experience collected during testing, the more likely the “expected uplift” value necessary for the calculation and additional external factors can be estimated. My rule of thumb for an average test runtime and a stop-test strategy:

A test should run at least two to four weeks and include at least 1,000 to 2,000 conversions per variation. Activities (news letters, TV spots, sales, etc.) should take place during this time. There should be as many different channels displayed as possible (total traffic mix, either through targeting in the preliminary stage or segmentation in follow-up). If the test has achieved a minimal statistical significance of 95% (two-tailed, that is positive and negative) and if it is stable, then stop the test.

What have you experienced regarding test runtime or a stop-test strategy? I would appreciate your feedback! Links

Calculators

Start structuring your Experiments today!

Try iridion for free as long as you want by signing up now!

Create your free Optimization Project

Why you’re ending your A/B testing too early