Take Action on Results with Statistics

by Optimizely ‎09-23-2015 September 23, 2015 - last edited on ‎10-12-2015 October 12, 2015 by Optimizely

Hello Optiverse,

 

I’m Leo Pekelis, a Statistician at Optimizely. I just hosted an online workshop called “Take Action on Results with Statistics” as part of our hands-on Optimizely Workshop series. Today, we covered:

  • Why Optimizely built Stats Engine
  • How to tune Stats Engine to get the best performance for your unique needs for one goal and variation.
  • Choosing the optimal number of goals and variations for your experiment (preview)

 

First, why did Optimizely build Stats Engine?

 

In short, traditional statistics (or t-tests) was effective 100 years ago, but isn’t so effective in today’s landscape. Back then, results were looked at once and only once for a pre-determined sample size.

 

Today, A/B tests are much more complicated; there are multiple goals, constant iterations, and a desire to check results early and often. Unfortunately, some businesses still use t-tests when they A/B test, which comes with a couple of pitfalls:

 

Pitfall 1: Peeking

 

In traditional statistics, you likely want to see how your test is doing once your experiment hits the minimum sample size, even though it has not yet reached statistical significance, or p-value < 5%.  This is called peeking. Why is peeking a problem? Because every time you peek, you increase the chance of a false positive, meaning it’s showing the variation is winning, but in reality there is no difference, or even a loss.

 

Pitfall 2: Mistaking “false positive rate” for “chance of wrong answer”

 

Even if you don’t peek, there’s still a chance of making the wrong call. This is because a t-test guarantees 5% wrong calls, or false positives, among all your goals and variations. So if you are testing multiple variations and goals, and many of them turn out to be inconclusive, there is a high likelihood that a number of the conclusive tests are incorrect.

 

Fortunately with Optimizely, you don’t have to worry about making the mistake of peeking or mistaking a false positive for the chance of a wrong answer. Stats Engine takes care of these common pitfalls for you, but by knowing some statistics of your own, you can maximally tune the Stats Engine to get the most performance for your unique needs.

 

The Stats Engine and tradeoffs of A/B Testing

 

There are three tradeoffs of A/B testing that you should be particularly aware of when running a test:

  • Error Rates
  • Runtime
  • Effect Size/Baseline Conversion Rate

 

Let’s start with Error Rates.

 

In your Optimizely settings, you can adjust the statistical significance of a test before you run it. For example, if you set your statistical significance to 85%, your error rate would be 15%. In other words, when a winner or loser is declared, there is a 15% chance that the call is false and the winner is actually a loser, or the loser is actually a winner.  

 

Now let’s talk runtime.

 

Runtime is the length of your experiment. The longer your runtime, the more visitors you will need for your test to reach statistical significance. The exact number of visitors for a length of time depends on your daily traffic, and is unique to your business.

 

Finally there’s Improvement/Baseline Conversion Rate

 

Improvement let’s you know the percentage by which your variation is winning or losing, compared to the original. Your baseline test is usually the original, but the Results page allows you to toggle between different experiments.

 

What’s the tradeoff between the three? They are all inversely related!

 

For example, at any number of visitors, the less you threshold your error rate, the smaller effect sizes you can detect.

 

At any error rate threshold, stopping your test earlier means you can only detect larger effect sizes.

Finally, for any effect size, the lower error rate you want, the longer you need to run your test.

 

Preview: How many goals and variations should I use

 

Stats Engine is more conservative when there are more goals that low signal, or that aren't strongly or directly affected by the changes you made in your variation. Adding a lot of “random” goals will slow down your experiment.

 

Here are a few tips to keep in mind with multiple goals and variations:

  • Ask: Which goal is most important to me?
    • This should be the primary goal (not impacted by all other goals)
  • Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration.
  • For maximum velocity, only test goals and variations that you believe have highest impact.

 

That’s it! Have any questions? Feel free to chat with me here in Optiverse.

If you missed the workshop and want more than the highlights, the recording is here:

 

<script src="//fast.wistia.net/assets/external/E-v1.js" async></script>

Resources

  • For the Workshop slides, see below
  • To learn more from our Optimizely Workshop series, sign up here.

Comments
by sgibson
‎09-24-2015 September 24, 2015

Hi Leonid

 

Good class on statistics - I have a question about goals and statistics

 

For example we have several ways of getting in touch - phone, contact form, request call backs etc.

 

If all goals are equal could i agreggate them all together and calculate the statistical significance of these? Would this be statistically valid or would it increase the chance of seeing a false positive?

 

many Thanks

Level 2
by Optimizely
‎09-24-2015 September 24, 2015

Hi @sgibson,

 

Glad you liked the workshop!

 

Your question could be interpreted in a few different ways, so I will try to address all of the interpretations. 

 

1. You want to aggregate different measurements - phone, contact form, etc - into one single measurement which answers the question, "does my variation increase the chance of getting in touch?" This is possible to do, for example by creating a single binary goal which is attained if a customer gets in touch by any of your get in touch options. In this case this becomes a single A/B Test, although with a goal that uses data from possibly different sources. 

 

2. You are measuring several different goals, increase in conversion rate by phone, contact form, request call backs, etc. And currently not all of that data is part of an Optimizely experiment. In other words you can measure contact form clicks within Optimizely, but phone calls data is recorded in another data source.

 

3. You are measuring several different goals, increase in conversion rate by phone, contact form, request call backs, etc. And all of that data is part of an Optimizely experiment.

 

If you intended interpretation 3, congratulations. Optimizely's Stats Engine automatically aggregates multiple goals in a single experiment to keep the chance of seeing a wrong conclusion below the threshold you set in your project level settings, e.g. 10%. So nothing to worry about here.

 

If you intended interpretation 1 or 2, the crux of the issue becomes to get your external data into an Optimizely experiment. Otherwise performing a statistical calculation on multiple goals both inside an Optimizely experiment and outside would raise your chance of seeing a false positive, as you correctly point out. We are working on solutions to help you get external data into Optimizely, and I would be happy to talk more about how we can make this work for your particular situation. 

 

Best,

 

Leo
Optimizely
by sgibson
‎09-24-2015 September 24, 2015

Hi Leo

 

Thanks for the comprehensive response

 

Yes I did intend interpretation 3 as our phone calls are being pushed into Optimizely so will be using the stats engine.

 

in this case (from a statistcal point of view) could I then add up my conversions across each contact point eg 

 

Original 10,000 visits.                     

phone calls 50.        

contact form 5

test drive 5

Total leads 60

 

variation 10,000

phone calls 70

contact form 1

test drive 1

total leads 72

 

(Excuse the poor example)

 

Optimizely would only show if each of the goals are statitstically significant rather than the total.

is there a way I could calculate significance on the total?

 

Many thanks

Stephen

 

Level 2