Stats engine results vs classical stats results question
From what I understand, the stats engine was introduced in order to make calling a test faster and more reliable. I'm interested in how that works given I have a client test with the following results on conversions from home page visits to purchases:
12142 visits, 329 purchases
12,042 visits, 421 purchases.
This is a 29% lift, the conversion rate lines look like the data is not noisy, classical statistics says this is a solid win, eg using http://abtestguide.com/calc/
The stats engine says we're at 66% statistical significance, and there are ~2400 visitors remaining (I have found that number to be completely unreliable).
Maybe I'm an idiot, but all data from the "old way" of doing things points to this being a good win... except data from the stats engine.
Anyone have an explanation for this?
Hey @CROmetrics - There was actually a very similar question posted awhile ago in the community and Optimizely's Statisican, @Leo , responded. Check it out here: https://community.optimizely.com/t5/Using-Optimize
Can you let us know if that response answers your question or if you are still looking for more specific details? We'd be happy to help if so.
I have to admit at this point that I am also struggling to get my head around the new stats engine in Optimizely and am currently only working on my own calculations for verifying test data.
From what I have seen it actually makes calling a winner in a test a longer process and as @CROmetrics has pointed out, the number of visitors remaining just appears to be wrong all the time!
Here's my understanding:
- The old results page was only reliable if you're looking at it after a test has concluded (with a significant sample size). However, if you checked it sooner than that, the result would not be as reliable but it would show anyway. This leads people to jump to conclusions too early based on insufficient information.
- The new stats engine addresses the reality that people are going to check results frequently, and as a result is more conservative when it comes to declaring winners or losers.
Therefore the result you see at any time is more reliable. This allows you to make a decision before a significant result is reached, if you feel like it.
For example, if you heard feedback from users that they like Variation B better, and Variation B fits your long-term vision, and Variaton B is leading by +20% with 40% statistical significance, you can choose to go with that Variation B, even before it reaches 100% significance.
The issue isn't that the new stats engine takes too long to call a winner, it's that the old version called a winner too soon.
Thanks for posting your questions. I think they are important ones that a few customers might be having.
Also, thank you @greg for posting clarifying comments!
In answer to your first question, there’s one particular sentence of Greg’s reply that I would like to highlight,
“The issue isn't that the new stats engine takes too long to call a winner, it's that the old version called a winner too soon.”
The sort of classical statistics you are referring to are designed to be used in a very specific way. That is, pick one sample size in advance, preferably using a sample size calculator. Record your results at the sample size you chose, and only at that sample size. And test only 1 variation on 1 goal at a time.
Using classical statistics in any way that is substantially different than this prescription, such as peeking at your results more than once, or testing multiple variations or goals simultaneously, invalidates the assumptions of classical statistics, and therefore the results you get with classical statistics as well.
Stats Engine is designed to protect you from sources of statistical error that classical statistics does not account for outright.
The unaccounted source of error that I want to address in this post is the error that comes from testing multiple goals and variations at once. When you decide to call winners and losers at a .1 alpha level, you would expect to see at least two significant results out of 22 A/B tests by chance alone, even if all 22 tests had no true difference between variation and baseline. The reason I mention this is because following up on the experiment ID of your client’s test revealed that it was one of twenty-two concurrent experiments.
While the A/B test in question that you reference still remains fairly unlikely to not have a difference between variation and baseline, it’s not quite as unlikely as things may seem at first glance when you account for the 21 other experiments running at the same time.
Finally, we are looking to make power improvements to Stats Engine for low sample sized tests with less than and around 10,000 visitors. These changes will have the most impact for low conversion rates, as in the case of your client. We first concentrated on building Stats Engine to be able to protect our users from hidden sources of error, and reliably discriminate significant results at higher visitor counts and higher baseline conversion rates. We will be introducing an optimization in the coming weeks which improves finite sample power at low visitor counts and low baseline conversion rates. I should note that this does not affect most cases which reached a high level of significance, so currently significant results will not be affected.
I’d also like to answer your and adzeds’ questions regarding the visitors remaining number.
The visitors remaining number we display in our dashboard is defined as the number of visitors you will need to reach a significant result, if your baseline and variation conversion rates stay exactly where they are. Understandably, this is not a very realistic assumption. Observed conversion rates change by the day, hour and even minute.
The way we intended you to interpret visitors remaining is as a middle ground for how many more visitors you’ll need to see significance. If the magnitude of your improvement decreases from where it is now, then you’ll need more visitors than is displayed. And if the magnitude of your improvement increases, you’ll need less. We are currently working on an update to the results page that will allow you to get a better sense of how visitors remaining would change if your conversion rates materially change from where they are now. Stay tuned!
For now, one workaround is to compare visitors remaining + the number of visitors you’ve seen so far in your test to the output from our sample size calculator and then model different conversion rate scenarios through the calculator. Keep in mind though, results won’t match up exactly, since the in-product visitors remaining calculation takes into account how much evidence you’ve seen so far in your experiment, and how many other goals and variations you are currently testing.
I hope this information was helpful to both of you, and look forward to answering any further questions you may have.
Statistician at Optimizely
@Leo The client's test which started this thread has only 5 tests running now, not 20+, and several of those are not even tests, it's us using Optimizely as a CMS to inject scripts or control a page. Not a great use of Optimizely, but sometimes a necessary evil, and something I know Optimizely themselves do, so I don't feel too bad about it.
In any case, improving the reporting interface would be helpful, since my clients are universally smart enough to run basic classical stats and then look at their results page, forcing me to try to explain all this stuff, which is not easy. However, this explanation is helpful and a step in the right direction.