1-tailed vs. 2-tailed tests
although I know the theory between both tests I was wondering: 2-tailed tests are more reliable because the results are more precise. You will have to gather more data until you can see what is going on if you keep a significance level of 95%. This is because the confidence interval is getting bigger for our hypothesis (instead of 95% we now have 97,5% on both sides).
When we set the significance level to 90% we should have the same range as in a one-tailed test, right? At least for one side.
Now I still dont understand why 2-tailed testing is better. It tells you if a negative uplift is significant but who cares? Aren't we focused on getting improvements so that we only care about the significance of positive uplifts?
Hopefully I can clarify a bit about how our Stats Engine works to help explain why we moved to using 2-tailed tests.
- 2-tailed tests aren't more precise than 1-tailed tests, they simply measure the chance that a result is statistically significant in both the positive and negative direction. The p-value of a 1-tailed and 2-tailed test represents the same total amount of individual data points, it's just a question of whether or not a test result being worse than the baseline is important to measure.
- Thus you wouldn't have the same range on one side of the curve in a 2-tailed test as in a 1-tailed test. With 90% significance, the range on the positive side of the curve in a 2-tailed tests is 5%, but in a 1-tailed test it would be 10%.
- You are correct to say in a 2-tailed test with a 95% significance level you will actually have 2.5% on each side of the curve. Further, the probability that your actually observed conversion rates will appear within the confidence interval stays at 95%. The change as you collect more data (i.e. the test becomes more powered as you are gaining a stronger ability to detect an actual difference) will actually be that the confidence interval will get smaller. That is, we are able to say with 95% confidence that your results, if you rolled a variation out to your entire population of visitors, will appear in a much narrower range of possibilities.
- When you test, yes it is important to verify a positive lift is significant. However, we argue it is equally as important to verify a negative lift is also significant. False discovery rate allows us to make statements that are more inline with our customers' needs - what is the chance I might see different (or worse) results if i make this change for all my customers? This could happen if you implement a winner that is actually inconclusive, or if you implement an inconclusive result that is actually a loser, or even if you refrain from rolling out an inconclusive result that is actually a winner. False positive rate control only covers you in rolling a winner that is actually inconclusive. It does not cover you for rolling an inconclusive result that is actually a loser, or not rolling an inconclusive result that is actually a winner, but Stats Engine does.
Do these details help? Our goal is to ensure you have a thorough understanding of how our Stats Engine empowers you to feel strongly about the results of your tests and how those translate to lift for your business. If any of these concepts are still unclear, please do let me know and we can engage in a deeper 1-1 chat. Thanks for these great questions!
Solutions Architect | Optimizely, Inc.
thank you for the answer. A 1-1 chat would definitely be helpful because I still dont understand why there is no intervall for the control version in your stats engine
There isn't an interval for the control version because our usage of difference intervals is always expressed as being relative to control. So, unlike an expression of variance or standard error, which measures the confidence for the observed mean of a single population, our portrayal of difference interval is meant to say 'how confident are we in the variation population's mean (how wide the interval is) and that it's actually different from the control (whether or not its turned green or red and the bar has 0 intersection with the middle line)'.
If there's no difference observed between the variation and the control, then the observed mean for the variation (and the range expressed by the difference interval) is essentially telling you the range for the control as well.
Does this make sense? What other questions do you have?
I do recommend reading this article: https://help.optimizely.com/hc/en-us/articles/200039895#fdr_control