Rich-text Reply

Significance, Sample Sizes, and Stopping Tests

darwishg 03-04-15

Significance, Sample Sizes, and Stopping Tests

[ Edited ]

 

Below are thoughts and answers to common questions we have seen around Optiverse and from customers directly regarding seeing 0% significance, sample sizes and determining when to stop tests. 

 

Here is a link to our Webinar where we also discuss a few of these point and specifically talk about what to do when your test is inconclusive  (Go to 17:15). 

 

 

Why is sample size important?

 

A healthy sample size is at the heart of making accurate statistical conclusions and a strong motivation behind why we created Stats Engine. When your test has a low conversion rates for a given sample size, it means that there is not yet enough evidence to conclude that the effect you're seeing is due to a real difference between the baseline and variation instead of chance alone - in more statistical terms, your test is underpowered.


The table below provides an estimate of the sample size you would need to accurately detect different levels of Improvement (relative difference in conversion rates) across a few different baseline conversion rates, based on Optimizely’s sample size calculator/Stats Engine. It takes much less work (visitors) to detect large differences in conversion rates--just look across any row. Furthermore, for any given improvement, when you start with a small conversion rate, you need more visitors to detect a difference--that’s what you see in the columns.

 

 

   

Improvement




Baseline Conversion Rate

 

5%

10%

25%

1%

458,900

101,600

13,000

5%

69,500

15,000

1,800

10%

29,200

6,200

700

25%

8,100

1,700

200

*The table above represents rounded numbers from our sample size calculator

 

Stats Engine allows you evaluate results as they come in and avoid making decisions on tests with low, underpowered sample sizes (a “weak conclusion”) without having to commit to predetermined sample sizes before running a test. The reason that you want to avoid making decisions on underpowered tests is that any improvement you see is unlikely to hold up when you implement your variation, potentially causing you to spend valuable resources and realize no benefit. (Read more about the importance of sample size by us and 3rd parties.)

 

 

Why does my test show 0% significance even though I have a “big" improvement?

 

Believe it or not, it’s probably because the number of visitors in the test is still quite small relative to the improvement you have seen thus far.

 

Referencing the table above, if you have a conversion goal with a 10% baseline conversion rate and are seeing a 10% lift, you need 6,200 visitors to make a strong conclusion with 90% Statistical Significance. A test with 2000 visitors is not even ⅓ of the way to having enough data to declare significance. Stats Engine does not to make any significance declaration when the sample size  is still small, since there is not yet much evidence that the variation is better than the control. As the number of visitors increases to 40% or even 50% of the total visitors needed to reach a significant conclusion, you will see Statistical Significance begin to increase.

 

As explained in our behavior of Statistical Significance article, significance increases from two types of evidence: larger conversion rate differences and stable conversion rate differences over more visitors. Eventually, you will have enough evidence - likely as a combination of both of these - to increase Statistical Significance.



Example:

 

 Screen Shot 2015-03-04 at 12.58.17 AM.png

 

 

For variation #1, a 20% lift on a 30% baseline requires ~500 visitors. This test (155) visitors in, is still relatively far from the 505 visitors needed to declare a winner .

 

 

When should I stop running a test?

 

 

When to stop a winning or losing test

Your test is a winner (or loser) when it shows statistical significance greater than your desired significance level. Pat yourself on the back for getting a conclusive result.

 

When to stop an inconclusive test

Screen Shot 2015-03-04 at 12.58.17 AM.png

 

In the example above, this test needs ~300 more visitors to reach 90% statistical significance. At this point, the experimenter should simply decide if she can afford to wait for 300 more visitors. If she really wanted to, she could use the sample size calculator to play out a  scenario of how many visitors would be needed were the improvement to drop to 17% (500 more visitors).


Screen Shot 2015-03-04 at 1.02.52 AM.png

 

 

 

 

Woah! Good info. So what do I do if I don’t have a lot traffic?

 

Use the difference interval!

 

Screen Shot 2015-03-04 at 1.03.55 AM.png

 

 

 

Let’s say this is the end of the quarter and you can’t run the test longer, but you love this test. How can you use data to back up your hypothesis? The difference interval is a confidence interval on the difference in conversion rates. In other words, it shows you the range of values that likely contains the absolute difference between the conversion rates that would show up in a long run implementation of your variation over baseline. It contains valuable information collected by your test so far!

In this case, there is a 90% chance that the absolute difference in conversion rates lies between -10.93% and 29.18% if you were to implement Variation #1. That means there is a better chance that this test has a positive effect than a negative effect. Of course, it’s still risky to implement this variation. But if your time is limited and you need to make a decision one way or the other, it might be a risk you’re willing to take.

 

A useful and easy risk analysis that you can do with a difference interval is to report best case, worst case, and middle ground estimates of predicted lift by reading the upper endpoint, lower endpoint, and center of the difference interval, respectively.

 

As the test runs longer and gathers more data, the difference interval will shrink. If this variation is actually a winner, the difference interval will eventually exclude the possibility of any negative values, at the same time that Statistical Significance increases to 90%.

 

Testing Bigger

 

Understanding the tradeoff between improvement and sample sizes should not only tell you when to stop or keep running a test, but it should also inform your testing strategy. If your  site doesn’t have a lot of traffic, you might not have the luxury of chasing down 5% lifts on 5% baseline conversion rates.

 

We even make this trade off at Optimizely on our own A/B tests. You can choose to test more hypotheses and chase bigger lifts, or test fewer ideas and go after smaller ones. There is a balance for every business, and as you test more you will find yours.

 

If you’re looking for place to start, you might ask a question like: Would you rather run 20 tests a quarter and shoot for a 12% lift each time, or test 5 tests each quarter and chase a 5% lift in each of them?

 

Now play around with our sample size calculator and keep in mind your site’s traffic to calibrate those numbers to your business.

 

 

We discussed some of these topics in our Webinar as well - check it out

 

Best,

Darwish 

 

 

 

 

--
Product Manager at Optimizely
Optimizely

Re: Significance, Sample Sizes, and Stopping Tests

Hi Darwish,

Thanks a lot for your explanation.

I would like to understand better how to interpret the sample size calculator.

I am used to calculate sample size after specifying a power (usually 80%) and a significance level (usually 5%). This is what is done in classical statistics in order to control type I and type II errors.

Now, I understand that the new stats engine is not based on classical statistics. However, the fact that we are not required to specify a power when calculating a sample size is what confuses me.

My question is: is there a more quantitative figure for what the sample size calculator actually means? For example, if we wait for the amount suggested by the sample size calculator (for a given baseline and mde) are we expected to find an effect at least 50% of the times within this period?

Thanks!
Claudio