04-13-15

# Testing with low traffic, probability of no difference or a loss

[ Edited ]

Hey,

Running in a bit of a circle here with my thoughts. I'm testing on a site with extremely low traffic. It often takes 2-3 months to conclude a single test.

So lets say I have a treatment where variation 1 purchases are 43 (1541 visitors), original purchases are 35 (1478 visitors). From here I cannot conclude that variation 1 is performing better as the test hasn't reached statisticsal significance.

But now the question is how high is the probability that there's no difference or that vairation 1 will eventually perform worse?

How can I calculate this?

Sincerely,

Marie

--
Leho, marketing & tech architect | G+: lkooglizmus@gmail.com
Level 4

Leo 04-13-15

## Re: Testing with low traffic, probability of no difference or a loss

Hi @lkraav,

You’re correct that you can’t conclude statistical significance based on the conversion rate difference and visitor count you show. As far as what sorts of calculations you can now do, and what information to take from an inconclusive test, there are a few things to consider:

1)

First, the quantity that you want to calculate, “the probability that there is no difference from the conversion rates seen so far,” is not the same thing as a traditional p-value. The p-value represents the chance that you would have observed your purchase rates, if there really was no difference between your variation 1 and original. The certain, conditioned, or unchanging part of the previous statement is “no difference,” and the uncertain part, on which you make a probability statement, are the “observed conversion rates.” In a p-value, it’s taken as fact that your variation is no different from the original.

This is also why you can’t conclude “no difference between variation and original” on a traditional A/B Test. If the size of improvement relative to visitor counts isn’t large enough, you can only conclude that you didn’t see enough evidence to disprove no difference.

2)

Second, the statistical significance we now display is a conservative estimate of exactly the quantity you want to estimate. In fact, this was a driving reason behind our switch to calling winners based on false discovery rates instead of false positive rates. The way to interpret statistical significance is in the following sentence: “The chance that variation 1 will eventually perform significantly differently from original (be declared a winner or loser) is at least [STATISTICAL SIGNIFICANCE].”

If you want to learn how to calculate this quantity for yourself, our technical write-up on Stats Engine is a good place to start, especially section 3.

3)

Third, and finally, I’d suggest confidence intervals as another way to provide useful information on an inconclusive A/B Test. A confidence interval on the difference in conversion rates is defined as a range of differences which are likely to contain the eventual difference between your variation and original. The confidence intervals (difference intervals) we show are linked to the significance level in your project level settings, so if you set the level at 90, you’ll see a range of differences which is 90% likely to contain the eventual difference. You can treat the endpoints and midpoint of the interval as high, low, and middle ground estimates of the eventual difference.

While you can calculate a confidence interval based on traditional statistics, it will not be the same as the difference interval in Optimizely’s results page as these intervals account for the two issues we have addressed with Stats Engine: looking at your results more than once, and testing multiple goals and variations.

I hope this explanation was helpful. We also discuss stopping vs continuing A/B Tests and what information you can take away in a recent webinar: https://community.optimizely.com/t5/Presentations/Webinar-recording-Stats-Engine-Q-amp-A-webinar-rec...

Best,

Leonid Pekelis
Statistician at Optimizely
Optimizely
lkraav 05-06-15

## Re: Testing with low traffic, probability of no difference or a loss

Hi Leonid,

Thanks for the thorough reply. Very much appreciated.

A couple of clarifying questions still for points number 2. and 3. I hope you have a chance to reply.

1.

Got it. Hypothesis 0 is that there's no difference in either way and Hypothesis 1 that we want to prove is that there's a difference (either positive or negative).

2.

With the new stats engine you switched from one-tailed to two-tailed tests, lowering the stat. significance to 90%. Can you give me a comparison of this sentence as to what it would have been before the new stats engine was introduced?

My guess: "The chance that variation 1 will eventually perform significantly better from original is at least [STATISTICAL SIGNIFICANCE].”

Hence, before new stats engine we we were accepting a 5% chance (95% confidence interval, 1-tailed) that the test result is a false positive, now we are accepting a 5% chance (90% confidence interval, 2-tailed) that we will detect a false positive and a 5% chance that we will detect a false negative (difference, as you said it). Right?

So is the testing with the new stats engine faster as well? In which cases?

3.

Can you clarify what do you mean by Optie’s confidence intervals being different as the “new Stats Engine looks at results more than once and tests multiple goals and variations”?

Marie

Conversion analyst @ ConversionXL

--
Leho, marketing & tech architect | G+: lkooglizmus@gmail.com
Level 4
marie 05-27-15

## Re: Testing with low traffic, probability of no difference or a loss

Any thoughts on this?
Level 2
Leo 05-28-15

## Re: Testing with low traffic, probability of no difference or a loss

Hello @marie

Apologies for not answering this sooner. We must not have seen it the first time around. Thank you for being persistent!

2.

Not quite. While we did change from a 1 tailed test to a 2 tailed test with the new stats engine, this does not by itself change the interpretation of statistical significance. Two 1 sided tests at 95% significance is mathematically equivalent to a single 2 sided test at 90% significance. You can read more about this here: https://community.optimizely.com/t5/Strategy-Culture/Let-s-talk-about-Single-Tailed-vs-Double-Tailed...

The change made with Stats Engine is even more fundamental than that.

The sentence for statistical significance before Stats Engine was: “The chance that variation 1 would see the observed improvement (43 and 35 conversions out of 1541 and 1478 visitors)

or less, if there really was no difference, is [STATISTICAL SIGNIFICANCE]”

The previously reported false positive rate only told you how likely it would be for random fluctuation to generate your test results. In other words, what is the chance of these test results, assuming no difference.

With the switch to false discovery rates, our statistical significance number now gives an estimate of: the chance of no difference, after seeing your test results. This is exactly the answer to your question of “how high is the probability that there is no difference?”

The chance of a false negative is a bit of a different quantity altogether. A false negative occurs when you fail to reject a test that does actually have a difference between variation and baseline. While we do not give estimates of the false negative rate in our product, you can assume that if you run your tests in roughly in accordance with our sample size calculator ( https://www.optimizely.com/resources/sample-size-calculator/ ), then you’ll have a low false negative rate.

3.

As a person running an A/B Test I could take two actions to invalidate my test results: looking at my statistical significance more than once (monitoring my test over time), and testing many goals and variations at once.

Instead of telling our customers they could not take these very reasonable actions, we designed Stats Engine to allow our customers freedom to look at results as many times as they want, and test multiple goals and variations, while still giving valid, statistical results.

Our blog post on Stats Engine ( https://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-... ) goes over this in more detail.

Best,

Leonid Pekelis
Statistician at Optimizely
Optimizely
marie 05-28-15

## Re: Testing with low traffic, probability of no difference or a loss

Hi @Leo,

Thanks for the follow-up. I really appreciate it.

2. Got it. Thanks for the clarification.

3. Regarding testing many goals and variations at once I would like to bring out this sentence from the blogpost: “As you add goals and variations to your experiment, Optimizely will correct more for false discoveries and become more conservative in calling a winner or loser.”

Does this mean that if we add multiple goals to a treatment, even if not all of them are relevant to us to make a business decision we’re essentially adding extra goals for the stats engine to consider when it calculates statistical significance? Essentially we’re wasting time? We’re adding page views and other micro goals to every test to cross check sales funnel numbers, but looking at the sentence from the blog post, it makes me think we should stop this immediately.

Also does this explain the immediate change in statistical significance when a couple of variations were deleted, as the question asks here: https://community.optimizely.com/t5/Using-Optimizely/Deleting-variations-from-a-A-B-C-D-E-changes-st...

Sincerely,
Level 2
Leo 05-29-15

## Re: Testing with low traffic, probability of no difference or a loss

Hi @marie

The short answer is yes. If time to significance is a limiting factor for you, I would suggest looking at the goals you are adding to your experiments and deciding if they are all necessary for you to make business decisions.

There is one important caveat here. The goal you mark as a ‘primary’ goal and your ‘total revenue’ goal are not impacted by the number of other goals you add. The intended use case is to have a single goal that is most important for your business decisions, which you’ll mark as the ‘primary’ one, and have any number of secondary goals.

Stats Engine correctly accounts for your inclusion of multiple secondary goals to not find false positives simply because there are many goals and variations to check. The benefit is that you can test as many secondary goals as you want without having to worry about an increased rate of false positives.

As for your last question, yes, if you delete variations from your experiment, then the multiple testing impact to remaining variations will be lessened and you may see an immediate jump in significance.

While this may increase significance, we do not advise doing this often as your chance of finding a false positive would increase as well. The reason is that you chose to keep a  particular variation precisely because it was close to significance, and removed other variations because they were far from it. It would not be correct to calculate significance for the lucky variation was kept as if it were in a pool of fewer variations, since what you really did was pick it out of a larger pool of variations.

If you are attending Opticon, I will be giving a talk titled “Statistics in 40 Minutes” that will explain this idea more. If not, I’ll be happy to share the slides with you if you ping me afterwards.

I also posted a slightly extended version of my answer to your point 3 in the “deleting variations” thread you reference above.

Best,

Leonid Pekelis
Statistician at Optimizely
Optimizely
marie 06-01-15

## Re: Testing with low traffic, probability of no difference or a loss

Hi @Leo,

Thanks for putting your time and effort into the responses. All logical.

And great to hear that primary and total revenue goals are not influenced by the number of secondary goals.

I'll have a look at the deleting variations response for more insight!

Great discussion.

Sincerely,

Marie Polli
Conversion Optimizer at ConversionXL Agency
Level 2