Today we are excited to share that Optimizely's new Stats Engine now powers results for all customers. Stats Engine gives you results that are always valid and accurately represent error rates, no matter when you look.
We've put together this FAQ to help you understand how Stats Engine affects your results in Optimizely. Starting today and running through Friday, you can also ask our statistics team anything over on this post in Optiverse. Our in-house statistician, Leo Pekelis, and product manager for statistics, Darwish Gani, will be answering questions from the community all week.
Finally, we also have some additional resources available for learning more about the new Stats Engine:
- Optimizely Blog: Learn why we made a new Stats Engine, and how it benefits your experiments
- Knowledge base: Learn how Optimizely calculates statistical significance for your results
- Technical paper: Get the math behind the new Stats Engine
Without further ado, here are the FAQs:
Why did you make a new Stats Engine for Optimizely?
Classical statistical methods like the t-test are not the best fit for online experimentation. The Internet makes it easy to look at your experiment results at any time and run tests with many goals and variations. When paired with classical statistics, these intuitive actions can increase the chance of incorrectly declaring a winning or losing variation by over 5x.
To get valid results from A/B tests run with classical statistics, careful experimenters follow a strict set of guidelines: Set a minimum detectable effect and sample size in advance, don't peek at results, and don’t test too many goals and variations at once. We heard from our customers that these guidelines are cumbersome and unintuitive, so we worked with statisticians from Stanford to create Stats Engine.
How is Optimizely calculating statistical significance with Stats Engine?
- We no longer use a "fixed horizon" hypothesis framework, which evaluates significance at a set point in time based on a predetermined sample size. Instead, Stats Engine uses sequential testing, a framework in which significance is meant to be continuously evaluated and always valid.
- In traditional fixed horizon statistics, we used a 1-tailed t-test to calculate p-values on a standardized difference of means (also known as a “t-score.”) Because of our move to sequential testing, we will be using a 2-tailed likelihood ratio test that calculates p-values derived from the average likelihood ratio over time. This also means the default statistical significance level in Optimizely will also now be 90%.
- We are introducing control for false discovery rate, a type of statistical error that results from testing many goals and variations at once (also known as the multiple testing problem.) This means that Optimizely results now control error and display statistical significance according to the values users expect.
These new methods applied together mean that your experiment results will always be valid whenever you look, and all of the information you need to make statistically supported decisions are now available directly in Optimizely. Read the full story about why we made Stats Engine over on the Optimizely blog.
What changes will I see in Optimizely as a result of Stats Engine?
You won't see many changes in Optimizely as a result of these under-the-hood updates, since we designed these new methods so that you don't have to change anything about how you use Optimizely today. These are the changes that you will see:
- Chance to Beat Baseline has been renamed Statistical Significance, and is now a reflection of confidence in the significance of your results. This means that you will now see losers called at 90% significance (they will be highlighted in red) as well as winners (highlighted in green).
- The default statistical significance level for your test to call a winner or loser is now 90%
- You can now set your own significance threshold from Home > Settings, to reflect your business’s need to balance between testing faster and accepting more false positive results or testing slower and being more accurate.
- You can now see how much longer Optimizely estimates you'll have to wait before your test calls a winner or loser, assuming the observed conversion rates were to hold.
- Visual confidence intervals now display the range of absolute improvement you can expect to see should you implement any given variation.
Are my historical test results still valid?
If you were using a sample size calculator, only looking at your results once, and correctly accounting for the number of experiments, goals and variation you tested, don't worry! On the other hand, checking significance more than once during a test, and making decisions on multiple variations and goals as if you only had one, all increase the chance of making an incorrect decision above the stated significance level. We created the new Stats Engine so you can do either, and still trust that your statistical risk is correctly represented by what we display.
Which experiments does this affect?
The new Stats Engine framework is only applied to experiments started on or after 1/21/2015. Your historical test results are still calculated using the traditional methods, and at the same significance cutoffs.
Do I still need to use a sample size calculator?
No! You can now start running a test without any sample size in mind. The test automatically zeroes in on an effect as it runs, so the choice of when to stop a test is completely up to you. On the results page Optimizely now displays the estimated number of visitors remaining before the test might be called significant, and our sample size calculator is updated to reflect the Stats Engine framework.
Do I still need to calculate statistical power for my tests?
There is no longer a need to consciously set statistical power. Instead the specificity of your results depends only on how long you are willing to wait. The sequential test we implemented is a test of power one, which means it will detect a non-zero effect size always, if you wait for enough samples. Waiting longer on any test gives you more chance to detect a winner or loser, if it exists.
How does Stats Engine's changes affect revenue calculations?
We extended Stats Engine to also work on continuous goals. Your revenue driven A/B tests are now fully sequential and corrected for multiple testing as well!
Why is the new significance level 90%?
Because of the new controls introduced for testing multiple goals and variations, we have switched our test to a two-tailed interpretation. A 90% two-tailed test is mathematically equivalent to our old 1-tailed interpretation where we declared winning variations above 95% Chance to Beat Baseline and losing variations below 5% Chance to Beat Baseline.
New FAQs added from our AMA with Leo Pekelis, Optimizely statistician:
How does Stats Engine work with a revenue per visitor goal?
Stats Engine works as intended on revenue per visitor goals. You can look at your results at any time you want and get an accurate assessment of your error rates on winners and losers, as well as confidence intervals on the average revenue per visitor.
In fact, your estimates should be more reliable, sooner, because we now correct for the inherent skewness in calculations based on revenue. One of the changes we made with Stats Engine is we now compute skew corrected test statistics for revenue (or any other goal that can potentially take on many values). Significance values are adjusted by a correction factor which estimates skewness from your currently running experiment. Not only does this make results reliable with considerably fewer visitors, but it also results in a more powerful test (when looking at historical revenue tests, the number of conclusive results jumped by a factor of 1.5).
Another feature of skew corrections is the resulting confidence intervals are no longer symmetric, but naturally adapt in the direction of the skew.
Is it possible to estimate how long a test will run with Stats Engine?
While you are no longer forced to use a sample size calculator, planning out your A/B testing strategy can still be very beneficial.
We’ve replaced the sample size calculator with an estimate of the average number of samples it will take to get a significant result. It still works the same way. You put in estimates of your baseline conversion rate and what effects you are looking to detect and you’ll get a good estimate of how many visitors you can expect to need for significance. (here's the link: https://www.optimizely.com/resources/sample-size-c
Does Stats Engine take into account how traffic is allocated within an experiment?
Stats Engine handles unequal traffic to variations and the baseline without a problem. It calculates significance while taking into account any imbalances in visitor counts.
That said, it’s still possible to run into problems if you decide to change traffic allocation based on the results of your A/B test while it’s running (i.e. putting more traffic into the variation as you see it’s getting closer to being called a winner). This sort of dynamic traffic allocation is trickier to deal with and is the subject of a statistical procedure called bandits. The good news is there are a lot of connections between bandits and sequential testing. This is an area we are very excited to start looking into for Optimizely in the near future.
Does Stats Engine wait for a minimum sample size before a winner is called?
Stats Engine waits for at least 100 visitors and 25 conversions on each variation before we show any results (it’s rare that you’d see significant results at this sample size anyway).
The new system also avoids calling winners very early, so that if you do see a winner with few visitors you know that is has a high chance of really being a winner. Put another way, seeing 90% significance means you have a 10% chance of the variation not really being a winner, whether you have 1000 visitors or 1 million.
How does Stats Engine deal with variations in the visitors to my experiments over time?
With Stats Engine, business cycles and other temporal variation will still have an impact on your results. Waiting several days or through a whole business cycle can be especially useful to get a more accurate representation of the amount of lift generated by a winning variation. To mitigate the effects of this variation, we added a feature to Stats Engine to adapt to an underlying shift in the effect size of your test (for example weekday vs. weekend effects).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.