Rich-text Reply

Let's talk about Single Tailed vs Double Tailed

cubelodyte 08-15-14
Accepted Solution

Let's talk about Single Tailed vs Double Tailed

My boss just forwarded an article link to me. It came in a solicitation email from an Optimizely competitor. To quote my boss,  "While I hate marketing ploys like this, it is an interesting read."

 

The article discusses what it considers to be a significant flaw in Optimizely: the exclusive use of "Single-Tailed" results analysis. The article goes on to use phrases like "Fairy Dust", "Short-Term Bias" and "Regression to the Mean". The author also describes a situation where his testing showed a strong and consistent bias towards the variation.

 

In truth, I've not come across this particular phenomenon. In fact, over the last 2 years, I have occasionally used A/A testing to investigate the possibility that Optimizely results might be weighted in favor of the Original baseline.

 

In our shop, we use data tagging that is attached to nearly every conceivable event on the site (approx. 14 Million events/day). Since we are often more interested in total events in our KPIs than the binary (de-duped) results provided by Optimizely, our primary decision making process is weighted toward analysis of our own event data.

 

Topics for Discussion:

All of this leads me to wonder how other shops are setting up their strategies.

 

  • Do you use Optimizely results exclusively?
  • If so, have you found an unnatural bias towards winning variations?
  • If not, how do you supplement your findings?

 

Bonus Question for Optimizely

If you have information to speak in response to this article, I would love to hear that in this thread as well.

Scott Ehly
Manager of Site Optimization
sehly@rentpath.com

'The single biggest problem with communication is the illusion that it has taken place.' - George Bernard Shaw
adzeds 08-18-14
 

Re: Let's talk about Single Tailed vs Double Tailed

Thanks for sharing the link to the article. I have not read this one before.

Unfortunately I don't like reading these sorts of articles as they annoy me.

I can understand where the author is trying to go with the post but the outcome will always be the same no matter which method you choose, it will always come down to how informed/experienced the tester is.

I have to admit that I do not like the way Optimizely announces that one of your variations is 'winning' as to a new/inexperienced tester this appears very much like the test has delivered a 'winner' - I think that can be improved in Optimizely.

All tests have to validated by the individual running the test no matter what the method selected so that will always be the most obvious point of failure in a test... That is why you have CRO Consultants/Experts.

I think that education of testers is an important part of the CRO industry and Optimizely is tackling that with areas such as their knowledge base and this community. It would be interesting to see what others thought of the post.
David Shaw
Level 11
MartijnSch 08-18-14
 

Re: Let's talk about Single Tailed vs Double Tailed

Can't agree more with the response above, Optimizely could do a better job of integrating the sample size calculator into their toolset. That should make it more obvious that in some cases they're not using a 99.99% confidence and probably will in the end incentivize using a higher valued package as their pricing limits on nr of visitors.
DavidWalsh 08-19-14
 

Re: Let's talk about Single Tailed vs Double Tailed

Hi @cubelodyte,

 

I’m a statistician here at Optimizely. Thanks for bringing this up—it brings up an interesting discussion about the best way to report statistical significance. It’s important to point out, though, that the article doesn’t paint a complete picture of how Optimizely evaluates success or failure of a variation.

 

The first distinction is that Optimizely actually uses two 1-tailed tests in evaluating variation results. This methodology allows us to report one of three outcomes for the variation: Winner, Loser, or Inconclusive. The post that follows should help shed some light on the differences between 1-tailed and 2-tailed testing and why, in practice, the conclusions you draw from experiments will be quite similar.

 

If you’re interested in the details, read on. Otherwise, here’s the TL;DR:

 

  • Optimizely actually uses two 1-tailed tests, not one.
  • There is no mathematical difference between a 2-tailed test at 95% confidence and two 1-tailed tests at 97.5% confidence.
  • There is a difference in the way you describe error, and we believe we define error in a way that is most natural within the context of A/B testing.
  • You can achieve the same result as a 2-tailed test at 95% confidence in Optimizely by requiring the Chance to Beat Baseline to exceed 97.5%. 
  • We’re working on some exciting enhancements to our methodologies to make results even easier to interpret and more meaningfully actionable for those with no formal Statistics backgrounds. Stay tuned!

 

First of all, whether you are running 1-tailed or 2-tailed tests, you start by computing the same Chance to Beat Baseline (p-value = 1 - Chance to Beat Baseline). This is a simple measure of how strongly the data indicates that the variation is better than the baseline. You can quote this Chance to Beat Baseline within the context of either a 1-tailed test or a 2-tailed test. The difference between the two depends on:

 

  1. The words you use to report your results (e.g. “winner,” “loser,” and “inconclusive”)
  2. The way you define “error” in your organization

 

The words you use to report your results
As mentioned above, Optimizely uses two 1-tailed tests in evaluating variation results. The first 1-tailed test answers the question: “Is this variation a winner or not?” The second 1-tailed test answers the question: “Is this variation a loser or not?” It is by combining the answers to these two questions that we get to one of the three possible conclusions: Winner, Loser, or Inconclusive.

 

A 2-tailed test, on the other hand, answers the question: “Is this variation different or the same as the control?” Within the context of A/B testing, most testing platforms that use 2-tailed tests will use a positive test statistic to indicate “Winner” and a negative test statistic to indicate “Loser.” While this may not follow statistical conventions, it’s generally okay in practice. But once a testing platform reports results using the words “winner” and “loser”, they are actually reporting results as though they were running two 1-tailed tests. That’s why the difference between two 1-tailed tests and a single 2-tailed tests is mostly semantic.

 

I also think it might be helpful to point out that, mathematically, any 2-tailed test can be interpreted as two 1-tailed tests and vice-versa. A 2-tailed test at 95% confidence (5% error) is equivalent to two 1-tailed tests at 97.5% confidence (2.5% error + 2.5% error = 5% error). In the case of Optimizely, two 1-tailed tests at 5% are equivalent to a single 2-tailed test at 10%.

 

In summary, if you conclude that your variation is better, worse, or not materially different than the original, you are—in the most technical sense—reporting them as if they are two 1-tailed tests, even if you had originally set out to conduct a 2-tailed test.

 

The way you define “error” in your organization
So why does the same procedure seem to give you “95%” confidence for two 1-tailed tests and “90%” for the single 2-tailed test? It comes down to how you define an error, and thinking about this emphasizes why using the right terminology can be important.

 

Because a 2-tailed test officially regards all differences equally, it defines an error as reporting a difference when there actually is none. In other words, in a 2-tailed test at the 90% level, we will tolerate reporting a conclusive difference when there is none in 10% of experiments. On the other hand, for two 1-tailed tests an error is defined as reporting a variation as a winner when it is actually a loser or vice versa. We believe this definition of error is the more natural within the context of A/B testing.

 

You may decide that you are more comfortable with a significance level corresponding to a 2-tailed test at 95%—or equivalently, two 1-tailed tests at 97.5%. The right significance level is up to you, and it comes down to making a trade-off between limiting errors and getting fast results. To make that trade-off, though, you need to ensure the error rate you’re focusing on is the one you really care about.

 

It’s not that Optimizely can’t detect results at a 97.5% significance level, but that we chose to control what we believe to be the right description of error at a 95% significance level. That’s why we always show customers our Chance to Beat Baseline—so that you can choose the significance level with which you’re most comfortable. To do so, simply use a sample size calculator and set the “Statistical Significance” level to the appropriate threshold.

 

Ultimately, we believe Statistics is a powerful tool, but it’s also a complex one that can sometimes be misinterpreted when attempting to take action. At Optimizely, our job is to allow customers to make statistically rigorous actions without a team of Statisticians. We are currently working on enhancements to make it even easier for customers to make informed and accurate decisions. Stay tuned!

 

Thanks,

 

David Walsh
Statistician at Optimizely
Statistics PhD Candidate, Stanford University

Optimizely
cubelodyte 08-19-14
 

Re: Let's talk about Single Tailed vs Double Tailed

David, 

Thank you for the detailed response. I've marked it as "Accept as solution" because in my mind, it definitely puts to rest any doubt there may have been over reporting methodology. 

 

That said, I was never in great doubt. My reason for posting was to prompt awareness; which I believe has been accomplished.

 

Beyond that, I'd still be interested in hearing from other testers about how they reach their conclusions. 

 

For my own part, I have a simpler, albiet less academic method for determining convergence (or at least reporting it). In my findings, statistical significance occurs when the graph lines of cumulative conversion run parallel. Rise or fall, if one variation remains steadily above or below its competitor, I'm ready to call it.

 

In addition to this, I also perform segmentation on known outliers. Early in an experiment, this segmentation will provide results that differ widely from the unfiltered view. In time, however, we reach a point in an experiment  where the analysis of the unsegmented data begins to look just like the segmented data. At this point, actual conversion percentages may differ, but the margin of difference does not. In other words, the graphs look identical, they're just higher or lower on the vertical axis. This too, indicates convergence.

 

These 2 methods combine as a very visual way for me to pass findings up the channel to stakeholders who are more interested in bottom line than a lesson is statistics. I'm very eager to see what you've got in the works to perform this level of simplification for me.

Scott Ehly
Manager of Site Optimization
sehly@rentpath.com

'The single biggest problem with communication is the illusion that it has taken place.' - George Bernard Shaw
Leo 01-23-15
 

Re: Let's talk about Single Tailed vs Double Tailed

Hello everyone,

 

Remember when we told you to stay tuned a few months ago? Well three days ago we kept our promise of making it ‘even easier for customers to make informed and accurate decisions’ by launching Stats Engine! Since one of the changes affects the response David gave above, I’m updating this topic to help keep everyone abreast of the updates.

 

TL;DR

  • We have introduced false discovery rate control to fully inform users of sources of error that come from testing many goals and variations at once.
  • Because false discovery rate control maps to a much more user friendly two-tailed interpretation, we are switching to the mathematically equivalent procedure of running a two-tailed hypothesis test.
  • Chance to beat baseline has been replaced with ‘statistical significance’ and both winners and losers are announced when statistical level is high (by default above 90%).
  • Significance is interpreted as evidence that there is a difference between variation and baseline, regardless if the variation is a winner or loser. Winners are positive differences, and losers are negative differencess.
  • The significance cutoff for calling winners and losers is now adjustable in your project level setting.

 

As David pointed out in his post, running a 2-tailed test at 95% significance and calling a winner / loser if the difference in conversion rates between a variation and baseline is positive / negative is mathematically equivalent to running two 1-tailed tests at 97.5% significance. You’ll make the same conclusions in the same cases and have identical rates of error.

 

So why did we make this change and why was it materially important for our new Stats Engine?

 

Stats Engine is designed to give Optimizely users a fair and accurate representation of the risk of implementing a variation that is no different than the baseline, taking into account what we have learned about the way users have used and want to use the platform. In particular, users of Optimizely want to have the freedom to look at their A/B test results in real time, and to test multiple variations and goals simultaneously.

 

In order to be confident that encouraging users to run many A/B tests would not expose them to hidden sources of risk, we moved from reporting chance to beat baseline to a more global measure of significance, which we call ‘statistical significance.’ For an example showing how these hidden risks might materialize in very apparent errors, check out our blog post on Stats Engine..

 

In statistical terms, we now impose false discovery rate control on the winners and losers found by your A/B tests. In practical terms, controlling the false discovery rate at 10% means that at most 10% of your winning variations are not actually winning, and losing variations are not actually losing. This is the same thing as saying that all of your winning and losing variations have statistical significance above 90%.

 

Calculating statistical significance for any one test now factors in all the other hypotheses (goals and variations) you are testing in the same experiment. Due to this global dependence, it makes a lot of sense to switch to a test which reports significance as evidence that there is a difference between variation and baseline, regardless if the variation is a winner or loser. If you don’t, you quickly run into technical issues, and under some interpretations of the problem, false discovery rate control no longer works!

 

We understand that running a two-tailed test at 90% significance (two 1-tailed tests at 95% significance) may feel too conservative for some users and this was one motivating reason why the significance threshold for calling winners and losers can now be changed in your project level settings. Informing customers of the statistical risk they are exposed to and presenting it in a clearly interpretable fashion is the best motivator for exposing it’s manipulation.

 

With Stats Engine we have not only created a more powerful statistical tool, but also one that is easier to use and allows our customers to make business decisions from their data with the confidence that their exposure to statistical risk is fully represented.

 

Leo Pekelis

Statistician at Optimizely

 

Leonid Pekelis
Statistician at Optimizely
Leo
Optimizely
techsquare 02-08-18
 

Re: Let's talk about Single Tailed vs Double Tailed

[ Edited ]

Excellent article!! Its just great.

My True Quotes