What does a perfect test runtime look like? Is there even the perfect runtime? What factors ultimately influence the test and its runtime? Is it even possible to predict the runtime at the preliminary stage? When should I stop a test? Without intending to ease the suspense right off– there is no perfect runtime, at least not in practice. The list of possible influencing factors is long. If the essential (known) aspects are broken down, however, then three different groups finally emerge:
- Statistical influencing factors (e.g. sample size, number of variations, measuring method, error tolerance, etc.)
- External influencing factors (e.g. traffic source, time-to-purchase, frequency of usage, campaigns, season etc.)
- Economic influencing factors (e.g. budgets, positive/negative effects, costs-benefit relationship, etc.).
Aspect 1: the statistical test powerIn this context, a traffic duration calculator is usually used in order to be able to predict a runtime. To avoid getting lost in the dry basics of statistics: In principle, the issue in this case concerns a mathematically modified form of the calculation of the test power. The test power indicates how probable it is to prove a difference between a null hypothesis (“there is no difference”) and a specific, alternative hypothesis (“there is a difference”). If the test power is high, a difference exists; if it is low, there is none. The greater the test power, the more likely it is to prove an actual effect (uplift). If this uplift is very big, a short test runtime is adequate for a good test power. If, however, the effect is small, then (much) more time is correspondingly required. To ultimately calculate the power, three things are needed:
- The ?-error (significance level, usually 95%),
- The sample size and
- the expected uplift (effect)
Aspect 2: What are all of the influences on a test?Is the number of samples large enough, in other words: If (very) much traffic is available, then – from the point of view of statistics – a test does not need to run for a particularly long time. This is precisely so when few variations are tested and the expected effect (also called impact) is high. The problem in this case is that, in the online world, you have no influence on the type of test participant. The selected sample is thus not representative at times. There are numerous examples that can be shown where variations run particularly well or poorly during the initial days, but the change (completely) in due course – despite a large sample at the start. Three aspects that have a determinative effect are the different behavioral patterns, type of usage, and surroundings variables. Thus, a series of factors can have an influence, such as:
- The day of the week (weekend vs. weekday),
- Situation (work place vs. couch vs. commuting),
- Mood (good weather vs. bad),
- Prior knowledge (existing customers vs. new),
- Motivation (promotion versus recommendation),
- and many others
Aspect 3: Stated with some exaggeration – how much will the truth cost me?The economic aspect finally also determines how long a test should run with positive, negative or possibly no effect. Is it monetarily justifiable to allow a variation with significant uplift to continue running?
The control burns cash, so to speak, doesn’t it?Even worse:
The variation is significantly worse, wouldn’t it be better if I shut it down, to avoid burning cash?One more classic:
The variations converge. Does it make sense to let the test continue to run?Although the answers to these questions can be deduced based on statistics and external influencing factors, this is always a question of judgment. The question is, Where should the focus be placed? On the most rapid test result possible that is shown to be adequate to reach a valid decision:
I want to know whether the variation is betterOr does the result need to be as accurate as possible and be safe from statistical coincidence as well as it can be:
I want to know by exactly how much the variation is betterThe calculation can determine this even in the preliminary stage: At 80% test power and one-tailed testing, the test will clearly lead to a result more quickly, for which exact reason test tools in the most favorable price range apply this combination. At 95% and two-tailed testing, it takes correspondingly longer but, in turn, is more precise. Example shows the influence of the test power and testing on the sample size (and thus on the test runtime.) Just calculating the test runtime alone allows economic aspects to provide the decisive tip of the scales. For example, based on the expected uplift and available traffic, the test must run so long that it is uneconomical to even carry out the test.
Then what is the right strategy for the test runtime?If there is an awareness of all of these influencing factors, then you should create an appropriate test runtime formula or a stop-test rule. Unfortunately, there is no panacea for this. Aside from the “hard” statistical factors, all other factors depend on the respective test situation. In the final analysis, the test hypothesis can “only” fail to function or not feature adequate contrast. There are several indicators that can help for your own formula under the pretext of the representative sample:
- In order to be able to display as many different usage cycles and types as possible in the test, it should run over entire weeks (e.g. Mon. to Mon.).
- As many as possible various users should take part in the test, e.g. existing customers and new customers (various traffic sources, campaigns such as newsletters, TV spots, etc.). Segmentation here is important in order to interpret the results and to be able to gain deeper insight.
- The test should run a longer period of time, even if high traffic is available. The traffic can be reduced under certain circumstances (traffic allocation), in order to be able to test for a longer period of time (reduce costs for traffic volumes).
- Don’t stop the test “too early.” The statistical and external influences kept in mind require test time, This is equally true for positive and negative results! Just because the tool displays 95% significance or higher after a short time does not mean that you can now immediately stop and celebrate.
- Consider the time to purchase and possible adaptation factors (existing customers). The test should at least include one cycle.
ConclusionThis article should not lead to the impression that the test runtime is arbitrary and cannot be planned – to the contrary. If you have become aware of the aspects presented here, then an estimate of how long a test should run can be made, even in the preliminary stage. The more experience collected during testing, the more likely the “expected uplift” value necessary for the calculation and additional external factors can be estimated. My rule of thumb for an average test runtime and a stop-test strategy:
A test should run at least two to four weeks and include at least 1,000 to 2,000 conversions per variation. Activities (news letters, TV spots, sales, etc.) should take place during this time. There should be as many different channels displayed as possible (total traffic mix, either through targeting in the preliminary stage or segmentation in follow-up). If the test has achieved a minimal statistical significance of 95% (two-tailed, that is positive and negative) and if it is stable, then stop the test.What have you experienced regarding test runtime or a stop-test strategy? I would appreciate your feedback! Links
- How long to run a test – Optimizely
- How large should your A/B test sample size be? – VWO
- Statistical Significance Does Not Equal Validity – ConversionXL