Multiple goals and variations impact on false discovery rates
I recently stumbled upon this piece of information:
"With Stats Engine, Optimizely now reports winners and losers with low false discovery rate instead of low false positive rate. As you add goals and variations to your experiment, Optimizely will correct more for false discoveries and become more conservative in calling a winner or loser. While fewer winners and losers are reported overall (we found roughly 20% fewer in our historical database*), an experimenter can implement them with full knowledge of the risk involved." (from here: https://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-...
The problem is that when I run treatments I have a couple of key KPI's I look at like revenue and purchase counts. At the same time we monitor other goals as well, such as click through rates, other page visits and so forth, without ever really making a decision based on them.
Now does this mean that just by adding those additional goals to be tracked Optimizely will automatically adjust the way it calculates false discovery rates to include all these goals and "become more conservative"?!?
Hence it will take longer to reach a false discovery rate of 5%?!?
Solved! Go to Solution.
Short answer: Yes, multiple goals are adjusted for in Optimizely, though goals marked as 'primary' and 'Total Revenue' are excluded from multiple testing corrections. One of Optimizely's statisticians goes into more detail in this thread.
One thing worth knowing - adding additional comparisons (ie, more goals or variations) can actually make the adjustment less conservative, depending upon the effect size (and thus, the resulting p-value). Here's an example - the first column shows the resulting p-values (unajusted) from five comparisons - two results have p-values <= 0.05 [ie, 95% statistical significance]. In the middle column, the p-values have been adjusted on only the first four comparisons (for the purposes of this example, we are effectively dropping the fifth comparison as if it didn't exist) to accomodate the False Discovery Rate using the same method Optimizely uses - none of the results are "statistically significant". The right column shows the p-values adjusted to accomodate the False Discovery Rate for all five comparisons - two results are now "statistically significant"
Great response, very down to the point and with a superb example.
For the sake of simplicity lets call the variations 1,2,3,4,5 (from top down)
Now do I get u right that the looking at the FDR adjustment columns your drawing out the comparison between running a treatment with 4 variations and with 5 variations and what the p-values would be everything else held constant?
But what if I have 5 variations running for a 4 week period and now delete one of them (5th one for the sake of the example). Do then the p-values change to the numbers you show for the FDR adjustment (4 comparisons) column?
So essentially by deleting a variation I end up with less significant p-values and have to wait for more data to come in?
If you could, it would really give a comprehensive overview if you could post the same table, but instead of deleting the 5th variation (which is highly unlikely in the real world as it is most significant of the 5), you'd delete the one furthest away from reaching stat. significance, the 3rd variation.
On your first question, this is probably a better way to visualize my original example. Consider two separate experiments – with Experiment 2, variations 1,2,3,4 have the exact same unajusted p values as Experiment 1 – the only difference is Experiment 2 has a fifth variation. The FDR adjusted p values shown in Experiment 1 are what you would see if you deleted Variation 5 from Experiment 2.
On the topic of likely vs. unlikely real world scenarios for deleting variations from an experiment (based upon what shows / doesn't show as "significant"): Anytime you add a variation you increase the likelihood of accepting a false positive. By deleting a variation you aren’t able to properly control your experiment’s false positive rate (since you’ve deleted the data required to do so!), so this is not a recommended practice.
As requested, here's what deleting Variation 3 from the original example does to the adjustment: