What is the simulated Type 2 error rate, or Power, of Stats Engine?
We give evidence that Stats Engine still has comparable Type II error to Fixed Horizon testing in our general blog post. Power is defined as 1 minus Prob(Type II error), and we show in the post that Stats Engine finds powered results - Power > 80%, or Type II error < 20% - and passes on under-powered results for historical Optimizely experiments.
While we felt this was the most important statistic to share in the general blog post - the performance of Stats Engine on real experiments - this is not quite the same as a simulated Type II error analysis.
We have felt that this wasn’t explaining the whole picture regarding the performance of Stats Engine, so we ran some simulations. Read on for the results!
I set up the simulation by assuming tests run for 1000 visitors in both variation and baseline. Then I simulated A/B tests from a range of effect sizes. The range had three levels, corresponding to the effect size needed to achieve 40, 60, and 80% Power - the chance that the non-zero effect is actually detected - when running a standard, fixed horizon t-test to 1000 visitors (only looking once, at the very end). I then simulated 1000 experiments for each effect size level using Stats Engine, and recorded the percentage found significant (the empirical power).
The figures below show the results.
First off, running both Stats Engine (Sequential) and a t-test (Fixed) to 1000 visitors shows that the t-test has more power, or lower type II error across the board. This is represented as the difference between the grey and orange bars of the first figure, and could be considered the cost of Stats Engine.
Yet this is misleading, and isn’t really a cost. Think about how you would have to achieve a target power in the fixed horizon world. First, find out the minimum detectable effect (MDE) of your test, then plug it into a sample size calculator to find out how long to wait to achieve, say, 80% Power, and finally wait to read off results because that’s what the sample size calculator told you. If the effect size of your A/B Test turns out to be lower or higher than your MDE, you’ll also achieve lower or higher Power. Only if you match your MDE to your effect size exactly do you get your target Power.
But an MDE is not an estimate of your effect size, it is a lower bound on the effects you want to detect, which dictates how long you should run your tests. And picking the MDE can be really tough.
If you pick an MDE that is higher than the realized effect size of your test, you’ll likely see an inconclusive result. Your choices are then to move on, or restart the experiment from scratch with a higher sample size. This throws away the whole first experiment! On the other hand, if you pick an MDE that is lower than the realized effect size, you’ll be committing yourself to waiting longer than you need to.
Compare this to the same scenarios with Stats Engine. In the first case if you find yourself not seeing a significant result, and believe this is due to a smaller effect size than you expected, you can simply wait longer to achieve significance. This is represented by the blue and green bars of the first figure. Waiting for 500 more visitors with Stats Engine now gives you roughly the same power at higher effect sizes, and waiting for 1,000 more visitors gives you comparable power at lower effects as well as higher power at larger effect sizes.
In the second case - realizing a higher effect size than your MDE - you’ll stop sooner. This is what the bottom figure shows. If you run Stats Engine on tests with effect sizes that correspond to 80% t-test power, you’ll stop earlier than the 80% power sample size calculator a little less than 60 percent of the time. As the effect size increases to 90% power, you stop early about 75% of the time, and the chance of stopping early increases from there.
To get a better feel for how these effect sizes translate to real Optimizely experiments, consider instead experiments that run to 10K visitors. A t-test now achieves 80% power for a relative effect size (aka improvement) of 10% off of a .1 conversion rate baseline. This is roughly the sort of experiment an average Optimizely customer runs. In this situation, the difference between 80 and 90% power is only a 2% higher relative effect, and the difference between 80 and 99% power is a 7.5% higher relative effect.
These scenarios: checking your results, seeing insignificance, and waiting for more visitors for a more powerful test, or stopping early if you realize a larger effect size, is exactly the sort of use case that Stats Engine is built for, and where fixed-horizon testing falls short.
Finally, I’ll note that you can get much worse Power performance with sequential testing if you don’t pay attention to factors like the effect sizes you are likely to see. Tuning Stats Engine, as described in our technical write-up, has allowed us to maintain comparable power to fixed-horizon testing.
Hope this was helpful!
Statistician at Optimizely
Glad you liked the post! Our whitepaper on Stats Engine ( http://pages.optimizely.com/rs/optimizely/images/stats_engine_technical_paper.pdf ) has the necessary formulas you would need to reproduce the simulations. Unfortunately we can't share the code iself as it uses our proprietary tuning methods.
Statistician at Optimizely