Deleting variations from a A/B/C/D/E changes statistical significance for the remaining variation
We had an A/B/C/D/E treatment running in Optie for 15 days, and since the difference in conversion rates was small we needed more traffic driven to the leading variation (E) and original in order to determine the uplift it is creating in a reasonable timeframe.
In order not to lose data already collected we deleted variations B/C/D from the same treatment.
Immediately after A/E reached statistical significance.
Is this because:
a) Optimizely takes into account the number of variations included in the treatment in some other way then just allocating traffic between them when calculating statistical significance?
b) Coincidence, E was just very close to reaching significance on its own?
c) Anything else?
Solved! Go to Solution.
Out of interest, was there a reason you stopped the test when you did and did not continue it for another 6-7 days?
I feel you would have reached a clearer result with a few more days in a test like this.
In terms of your questions regarding the change in data that is something I would suggest you get the Optimizely team to take a look at. I have seen my numbers change slightly over time. Which may be the case here as the numbers are not massively different.
In 15 days one variation gathered approximately 15 000 visitors. This would mean that in order the variation E to achieve stat. significance with a CR uplift of 3.4% we would have had to wait for 60 more days.
Not optimal, as the opportunity costs of what else could we test during the same time period, is too high. Hence the decision was made to restructure the treatment.
Yet, the question remains. Does Optie recalculate stat. significance somehow if variations are removed or was it pure coincidence...
Leho, marketing & tech architect | G+: firstname.lastname@example.org
Based on your stats I assumed you were an eCommerce retailer so the RPV factor would have been one of your key metrics, if not THE key metric of this test.
The 3-in-1 variation had an increase of 18%+ and was close to being considered better than the control so another 6-7 days may have helped to confirm that assumption.
Would you say CR is more important to you than RPV?
Thank you for bringing up RPV. It helped to find the reason.
3 big buys had occurred right before we looked at the data again for variation E, causing the RPV metric to reach statistical significance, while purchase counts remained insignificant.
RPV is a good metric for ecommerce, but it cannot be looked at alone, especially in our case, as wholesellers are also buying from the same site and will have a huge impact on this metric.
Hopefully Optimizely would one day allow automatically excluding outliers.
"With Stats Engine, Optimizely now reports winners and losers with low false discovery rate instead of low false positive rate. As you add goals and variations to your experiment, Optimizely will correct more for false discoveries and become more conservative in calling a winner or loser. While fewer winners and losers are reported overall (we found roughly 20% fewer in our historical database*), an experimenter can implement them with full knowledge of the risk involved."
from here: https://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-...
This essentially does mean that Optimizely now corrects automatically false discovery rates when you add/remove goals or variations.
Just thought I would chime in here. Since Optimizely now adjusts significance calculations based on the number of other goals and variations you are testing, deleting variations can impact the significance of the remaining variations. While deleting variations will not always change significance, when it does, it will always increase significance.
We generally do not advise this as a method for finding significant variations, however. The reason is that when you delete variations because they were underperforming (for example, had low significance), you raise the chance of finding false positives on the remaining variations by the same amount as the impact of testing multiple variations in the first place.
To help explain why this happens, consider the following scenario. If you are testing two variations compared to a baseline at a 90% significance threshold, you have a 10% percent chance of calling a false positive on each variation. This means that the chance of finding a false positive on either variation is 19% (1 - .9*.9). If you run both A/B Tests, and then delete one variation when the other looks close to being significant, you are effectively calculating results as though you had a 10% false positive rate, but incurring a false positive rate of 19%. The contribution to the false positive rate still continues to be felt after you delete a variation.
This same thought exercise can be generalized to more than 2 goals and variations, and the exact sort of behavior that Stats Engine is built to cope with. Stats Engine can’t cope when you delete variations, however, as you are effectively wiping them from its memory.
It is completely understandable that the impact of corrections for multiple goals and variations on test runtime can be notable for some customers.
For this reason we have excluded both the goal marked as ‘primary’, and the ‘Total Revenue’ goal from multiple testing corrections. This fits with a testing behavior where the primary and total revenue goals are the ones that are most important to an experimenter (they are central to a decision to implement a variation), and then any other goals are treated as secondary (peripheral considerations).
Second, even though a variation may have not yet reached significance, there is valuable information to be gained from the difference interval. In your first screenshots, the 3-in-1 variation has a difference interval that is mostly in the positive range. This interval will always give a reasonable estimate of the best case, worst case, and middle ground effects from implementing a variation, regardless of the significance level.
I would suggest using either or both of these methods in lieu of deleting variations as they give a more accurate representation of the variability in your results, leading to more informed decisions, and fewer unexpected outcomes.
We talk more about the uses of confidence intervals and recommended approach to A/B Testing with Optimizely in a recent webinar ( https://community.optimizely.com/t5/Presentations/Webinar-recording-Stats-Engine-Q-amp-A-webinar-rec... ), especially starting at around 17:15.
We are also working on even more best practices and statistics content for our upcoming Opticon. Please do check it out if you are planning to attend. And if not, be on the lookout for materials in the months to come!
Statistician at Optimizely