by ‎01-20-2015 January 20, 2015 - last edited on ‎03-04-2015 March 4, 2015 by

Hi, I’m Leo, Optimizely’s in-house statistician. I’m just finishing my PhD in Statistics at Stanford, and along with several of my colleagues at Stanford and Optimizely, I worked on developing the new Stats Engine. Darwish Gani, our product manager for Stats Engine, will also be answering questions in this "Ask me anything" discussion.

We’re here all week to answer any questions you have about Optimizely’s new Stats Engine, statistics for A/B testing in general, or anything else that comes up as you get familiar with our new statistical framework. To ask a question, simply reply to this discussion.

by
‎01-20-2015 January 20, 2015
"Because of the new controls introduced for testing multiple goals and variations, we have switched our test to a two-tailed interpretation"

Can you elaborate on this decision? Why not continue using the old approach of two 1-tailed tests at >95% and < 5%. If the two are mathematically equivalent, is this a material distinction or meant to address the criticism Optimizely has taken for using a 1-tailed approach?

Level 1
by
‎01-20-2015 January 20, 2015

Will the new results method also apply to:

• Archived tests?
• Paused tests?
• Currently running tests?

In other words, if I open a paused test next week, will I be looking at results from the old method or the new method?

Level 2
by
‎01-20-2015 January 20, 2015 - edited ‎01-22-2015 January 22, 2015

Hi Mike,

Yes, this is a material distinction. In order to implement controls for testing many goals and variations we moved to a 2-tailed test. While you are exactly correct that 1-tailed and 2-tailed tests are mathematically equivalent, it has not yet been 'proven' that they are when accounting for many A/B tests. With Stats Engine you can make claims like 10% of your winning and losing variations are actually different from baseline. This isn't always the same thing as 5% of your winners are winners and 5% of your losers are losers.

Optimizely
by
‎01-20-2015 January 20, 2015

Greg,

These changes will only effect experiments started on 1/21/2015 and after. These changes will not be applied to experiments retroactively. Currently running tests will also not be affected by stats engine (because the start date is before 1/21/2015).

Cheers,

Darwish

Optimizely
by
‎01-20-2015 January 20, 2015

How does this affect tests that use revenue per visitor as a goal? I am skeptical of "chance to beat baseline" for revenue, since as I understand it, traditional t-tests assume a normal distribution and are unreliable when the actual distribution of revenue is skewed. For istance, if I'm selling a subscription product where there are only a few distinct packages (e.g., \$50 / month, \$100 / month, \$200 / month), and most people choose the lowest priced package and few choose the highest priced one, will the new stats engine be more reliable?

Mike

Level 1
by
‎01-20-2015 January 20, 2015

A lot of the A/B testing I do is on ecommerce websites. We prefer to use Revenue Per Visitor as our main goal KPI. Can you discuss how the new Stats Engine will work with a Revenue Per Visitor goal?

Thanks!

Level 1
by
‎01-20-2015 January 20, 2015

Hi Mike,

Great question. You are absolutely right in saying that t-tests are less reliable when the actual distribution is skewed, due to the normal assumptions they make.

Short answer: we now account for this in revenue calculations!

As maybe one more level of detail on your example, if we compute the average revenue among all customers when they only have a choice of 3 price packages, the more customers I average, the more possible values the average can take. If I look at only 1 customer, average revenue can only take on 1 of 4 unique values. With 2 customers, average revenue can now be any one of 8 values (I counted). And so on.

This increasing number of possibilities with increasing number of visitors is why even averages of discrete distributions look more and more normal with bigger sample sizes. The issue is this can take a while and an uneven, or skewed distribution makes this even worse. So while a t-test will be reliable eventually, this can take a long time for skewed revenue distributions!

One of the changes we made with Stats Engine is we now compute skew corrected test statistics for revenue (or any other goal that can potentially take on many values). Significance values are adjusted by a correction factor which estimates skewness from your currently running experiment. Not only does this make results reliable with considerably fewer visitors, but it also results in a more powerful test (when looking at historical revenue tests, the number of conclusive results jumped by a factor of 1.5).

Another neat feature of skew corrections is the resulting confidence intervals are no longer symmetric, but naturally adapt in the direction of the skew.

I’m in the process of writing a separate technical paper on skew corrections for revenue tests and will be happy to ping you once it’s posted.

- Leo

Optimizely
by
‎01-20-2015 January 20, 2015

Hi Jeremy,

Stats Engine works as intended on revenue per visitor goals. You can look at your resuts at any time you want and get an accurate assessment of your error rates on winners and losers, as well as confidence intervals on the average revenue per visitor.

In fact, your estimates should be more reliable, sooner, because we now correct for the inherent skewness in calculations based on revenue. For more details on this, see my answer to @mikefiorilloCRO above.

- Leo

Optimizely
by
‎01-21-2015 January 21, 2015 - edited ‎01-21-2015 January 21, 2015

Hi Stat. Team,

First my complements to the team's effort in adressing the 'dance of the p-value' problem, very much appreciate this respons of Optimizely and providing us users of more reliable/usable stats.

I have 4 questions:

1) "Amount of goals tested doesn't influence the reliability of the restults" the stat engine takes this into account, I read

somewhere if I am correct.

Does the amount of goals influence the amount of sample data needed to calculate the reliability?

2) Sample size:  In the product update FAQ it is said that you don't need the sample size calculator any more.

But taking into consideration that

a) you need to "plan your traffic allocation before starting " (A Practical Guide to Statistics for Online Experiments, page 13)

b) you have to take into account your business cycle (weekdays/weekend days)

It still seems wise to me to make an estimate when possible to set a proper value for traffic allocation, am I correct?

In document

Stats Engine: How Optimizely calculates results to enable business decisions.

i read it is usefull to check with the new sample size calculator.

3) The 'Best of 3 test approach' (A Practical Guide to Statistics for Online Experiments, page 17, expert tip): why would this tactic work better then for instance, normally you run the experiment for 1 one week, then you repeat it 2 more times (= 3 weeks total) vs. one test and let it run for 3 weeks?

4) Does Stat engine take into account the setting of Traffic Allocation? I can imagine that a high level of Traffic allocation, wich indicates the sample size relative to the population, tells something about the reliability of the estimate, significance level too(?).

Jeroen.

Level 2
by
‎01-21-2015 January 21, 2015

Hi Leo, I have a few questions:

1. How can we estimate testing time in advance with the new system? (to help with planning)

2. Does the new system actively encourage regular checking to see if you have a winner?

3. Is there a minimum sample size before a winner should be called? (apart from the fact that ideally you'd have an even representation of weekly business cycles etc...) Or does the new system avoid calling winners really early?

4. I usually judge success or failure only by revenue per visitor, but also track other goals to et indications of the bigger picture. As I understand it, your new system will raise the bar for statistical significance as each new goal is added, to minimise the risk of testers from setting loads of goals and just getting lucky. Is there a way of indicating that my additional goals shouldn't be included in the success/failure calculations and therefore wont raise the bar for statistical significance?

Thanks,

Dave
@DaveAnalyst

Level 1
by
‎01-21-2015 January 21, 2015

Hi Jeroen,

Thanks for the compliment! We tried to think a lot about how to best tailor the stats to the way Optimizely’s users use and will use the product.

1)

The answer is sometimes. The way false discovery control works is that adding more goals (and variations) that have little to no signal between variation and baseline will make it harder to detect signal on the ones that do. Basically adding more noise into the overall group of goals and variations makes it harder to detect signal. And as far as you’re adding more overall noise, you’ll need some more sample data to get reliable results.

This is intentional, and we think pretty important, because as you add more goals and variations that are insignificant, you raise the chance that you’ll see falsely significant results, or false winners and losers. Our measure of significance now accounts for the fact that you’ll be searching among multiple goals and variations to pick out the most significant effects. You don’t have to spend time or excess effort worrying about the impact on your error rates on your own, or, even worse, be unpleasantly surprised when the results your find end up not panning out.

We understand that this may cause a little concern, especially when there is a very specific goal or variation that is important to your business. We don’t want you to be afraid of A/B testing, and A/B testing a lot. So what we did is to separate out your primary goal, or total revenue goal, from all other goals. This means that the primary goal and total revenue goal isn’t impacted by your other (secondary) goal, and your secondary goals are not impacted by your primary goal or total revenue goal.

The way I would suggest using this feature is to test goals on variations that are most important for your business as primary goals.

Finally, from looking at a lot of historical experiments, we found that the impact of ‘more sample data from multiple goals and variations’ was pretty slight. There's a good chance you wont even feel the impact.

2)

Yes! While we have now shifted to a platform where you are not forced to use a sample size calculator, planning out your A/B testing strategy can still be very beneficial.

We’ve replaced the sample size calculator with an estimate of the average number of samples it will take to get a significant result. It still works the same way. You put in estimates of your baseline conversion rate and what effects you are looking to detect and you’ll get a good estimate of how many visitors you can expect to need for significance. (here's the link: https://www.optimizely.com/resources/sample-size-calculator )

3)

Rerunning a test can give you an estimate of how robust your test results are to temporal or seasonal variation. While we did put in a feature to detect an underlying shift in the effect size of your test (for example weekday vs weekend effects), this detection will never be better than you explicitly running a test Monday to Sunday if a week-long estimate is precisely what you’re looking for. In this case, 3 week-long A/B Tests are a better estimate of a 'week effect' than one 3 week test.

4)

Stats Engine handles unequal traffic to variation and baseline without a problem. It calculates significance taking into account any imbalances in visitor counts.

One problem that you can run into, which is what I think you’re getting at here, is if you decide to change traffic allocation based on the results of you A/B test. For example, putting more traffic in the variation as you see it’s getting closer to a winner. This sort of dynamic traffic allocation is trickier to deal with and is the subject of a statistical procedure called bandits ( http://en.wikipedia.org/wiki/Multi-armed_bandit ). The good news is there are a lot of connections between bandits and sequential testing. This is an area we are very excited to start looking into in the near future!

Also, Shana is the mastermind behind “A Practical Guide …”. I let her know about your questions, and she may be around to elaborate on some of my responses in that realm.

Best,

Leo

Optimizely
by
‎01-21-2015 January 21, 2015

Hi Dave,

1)

We’ve replaced the sample size calculator with an estimate of the average number of samples it will take to get a significant result. It still works the same way. You put in estimates of your baseline conversion rate and what effects you are looking to detect and you’ll get a good estimate of how many visitors you can expect to need for significance. (here's the link: https://www.optimizely.com/resources/sample-size-calculator )

2)

We are working on implementing more ways to actively encourage regular checking. For example, soon you can expect an option which sends you an email when a variation reaches significance.

3)

We do have a minimum 100 visitors before we show any results. This is more for robustness of the platform than anything else.

The new system exactly avoids calling winners really early, so that if you do see a winner with few visitors you know that is has a high chance of really being a winner. Put another way, seeing 90% significance means you have a 10% chance of the variation not really being a winner, whether you have 1000 visitors or 1 million.

You point out very accurately that taking into account business cycles and other temporal variation will still have an impact on your results. So waiting can be especially useful to get a more accurate representation of the amount of lift generated by a winning variation. We now display visual confidence intervals on improvement which show a range of values that will contain the true lift with 90% confidence (where the 90% number is also linked to your project level significance threshold). This confidence interval will become more narrow as you get more visitors and more information about your test.

4)

Yes there is.

What we did is to separate out your primary goal, and your total revenue (revenue per visitor) goal, from all other goals. This means that the primary goal and total revenue goal isn’t impacted by your other (secondary) goals, and your secondary goals are not impacted by your primary goal or total revenue goal.

The way I would suggest using this feature is to test goals on variations that are most important for your business as primary goals.

Hope this helps clarify things. - Leo

Optimizely
by
‎01-22-2015 January 22, 2015

2) ".... you are not forced to use a sample size calculator,...".

>> What in this case where the new Stat engine is perfectly suited for: you don't know the exact CTR or amount of visitors.

Which things should I take into consideration when I adjust the traffic allocation during a test to get reliable results within a certain business cylce (business cycle = eg. a week, or purshase cycle of 3 days). A problew is that I skew the data(?).

4) traffic allocation....

What I was reffering to was not the multi armed banded direction, but ... besides that the stats engine needs a certain amount of data per variation, a certain CTR and significance level,  ... is there information in the knowledge that you test on 'only' 5% on your traffic vs. you test with 100% on your traffic? (What do you consider population? and is it point estimates...)

New question:

5) The False Discovery Rate Control:

"Reporting a false discovery rate of 10% means that “at most 10% of winners and losers have no

difference between variation and baseline,” which is exactly the chance of making an incorrect business decision. "

(source: Statisitcs for the internet age)

I then think of it as a measure where the significance level and power are both present,1 and 2 error.

Is this true? or is this 'only' looking at the type 1 error? ("... no difference between variation and baseline ...")

+

Is there a control or read out where we can see the value of the False discovery Rate Control? (should there be one?)

Kind regards, Jeroen

Level 2
by
‎01-22-2015 January 22, 2015

Leo,

Based on the below line, I am concerned that with tests where MDE is lower than 5 percentage points we will have to wait longer for results with Stats Engine than Fixed Horizon. Is that accurate? In practice most of our tests result in less than 5% wins for the KPI.

"We found that if the lift of your A/B test ends up 5 percentage points (relative) higher than your MDE, Stats Engine will run as fast as Fixed Horizon statistics. As soon as the improvement exceeds the MDE by as much as 7.5 percentage points, Stats Engine is closer to 75% faster. For larger experiments (>50,000 visitors), the gains are even higher, and Stats Engine can call a winner or loser up to 2.5 times as fast."

Level 1
by
‎01-22-2015 January 22, 2015

I love that Optimizely has recognized that it is human nature to peek at results and has adapted. This is a step in the right direction for sure. I was wondering how many tests you have run to prove that this is indeed better? I was wondering if you guys ever thought about having a no tamper button added to tests. So once the test is set up and running it is almost impossible to tamper with for set amount of time or a number of visitors reached. (Obviously there would need to be some overide ability but that could require access). This would be useful in organizations where there are a lot of Optimizely users.

Level 2
by
‎01-22-2015 January 22, 2015

Hi @nschlegel ,

Sorry if the wording isn’t quite clear on that! The 5% refers to the difference between the MDE that you predicted for you A/B test, and the effect size that you end up finding.

An issue is these gains and losses depend on a number of parameters - your baseline conversion rate, whether the variation is wining or losing, your threshold for significance, and your MDE, among other things. We tried to post as few numbers as possible to address a ‘typical’ Optimizely customer.

Here’s one example that I hope may clarify things in your situation. Say you have a baseline conversion rate of 10%, and you expect a MDE of 3% relative improvement, you want to call winners and losers at a 90 significance level, and you’ll be happy calling a Fixed Horizon test at 80% power (where the old sample size calculator was tuned). Put together this would have asked you to wait for about 125000 visitors before looking at your results.

Now suppose that instead of a 3% relative improvement, your A/B test ends up having 4% relative improvement. With Stats Engine, you will call this test a winner at about 96500 visitors, giving a 23% increase in speed.

Feel free to play around with our new out of product sample size calculator and compare to the Fixed Horizon formula to get a feel for differences. (There are quite a few good fixed horizon calculators for this out there, I usually search for ‘sample size calculator two sample test of proportions’).

- Leo

Optimizely
by
‎01-22-2015 January 22, 2015

Hey Jeroen,

It sounds like you’re getting into the area of experimentation strategy and planning. This is a great thing to think about, most successful approaches to A/B Testing will have an element of this, but the difficultly is there are fewer right answers because what you do will really depend on your particular situation.

2)

One thing that I would recommend keeping in mind is that a quick result on an A/B Test may change as time goes on. This isn’t necessarily because the test itself made a mistake, but because we are testing for the average difference between variation and baseline over time. So while a variation could look really good the first day it’s introduced, the effect averaged over a week is smaller because the variation doesn’t do as well on day 2, day 3, and so forth.

A useful way to use Stats Engine for you might be to not reduce allocation to zero on variations right away, but to keep a small percentage of traffic going to them for a bit. Then you can check back in the next business cycle to see how your results have changed.

4)

You’re right, this is a bit different than the bandit question. Usually we consider population to be all the visitors to your site. Since traffic allocation is random, whether you test on 5% of your traffic or 100%, you’ll eventually get a representative sample of your entire population. The issue with using only 5% of your traffic is that you may miss subpopulations of your overall traffic when testing for shorter periods of time. A benefit may be that your A/B Test will be active for a longer period of time, so you average over things like ‘day of the week’ effects.

5)

False discovery rate control is like flipping type I error on it’s head. So instead of asking what is the chance I’m making an error if there really is no difference between variation and baseline, I ask what is the chance that there is a difference using the results of my A/B Test so far. I’m controlling my error of calling winners and losers incorrectly.

Power, on the other hand, is really a measure of how good an A/B Test is at detecting results. If there is a difference between my variation and baseline, what’s the chance I’ll detect it? The way Stats Engine works is the chance of detecting any difference increases with more visitors, so you gain more power by waiting longer.

So false discovery rate is a measurement of making errors, while power is a measurement of detection strength, and while similar, they are not exactly complementary.

The “Significance” that is now reported on A/B Tests is a read out of false discovery rate control. A way to read it is if a winning A/B Test has 90 significance, it has about 10% chance of not really being a winner. And if you call winners and losers once they reach 90 significance, you can expect about 10% of these winners and losers to not pan out when you implement them.

Good and pointed questions!

- Leo

Optimizely
by
‎01-22-2015 January 22, 2015

Hi Ben,

Glad you like it! We are committed to pushing the envelope in making powerful statistics approachable and actionable.

How many tests to run depends on how you want to prove that it’s better. One exercise you could do is run 5 A/A tests for 5000 visitors each. Using a t-test and peeking after every visitor would very likely show at least 2 conclusive results. With stats engine you are about as likely to see 0 conclusives.

Account level permission settings isn’t something that we’ve talked much about recently, but I definitely see how that would be useful. I’ll mention it around here and see what people think.

- Leo

Optimizely
by
‎01-22-2015 January 22, 2015

@ben - you can also add this to the Product Ideas board so it is reviewd by our Product Team. It's a great idea.

Optimizely
by
‎01-23-2015 January 23, 2015 - edited ‎01-23-2015 January 23, 2015

Thx Again Leo,

To get my picture right on the matter, I put things in a table, based on a table of Evan Miller:

Can I write the false discovery rate like this? (should I change the 'change happening' into 'numbers'?)

Jeroen.

Level 2
by
‎01-23-2015 January 23, 2015
Hey I saw that the default statistical significance level for calling tests has been reduced to 90% (in the Settings). Why? Also, how come customers were not notified about it? Discovered it by chance.
Level 4
by
‎01-23-2015 January 23, 2015

That’s a cool table. You’ve almost got it right. Power (also 1 - Type II error) and Significance (also type I error) are both correct.

False discovery rate is the proportion of calls you make (green check boxes) that are not correct, so in your notation,

false discovery rate = C / (A + C)

It’s a nice table because you can really see the differences. With Type I and Type II error, you marginalize over the same symbols in the “Reality” column. With false discovery rate, you marginalize over the same symbols in the “Test says” column. Could be useful to use in our education materials!

There’s a less colorful version of your table on wikipedia ( http://en.wikipedia.org/wiki/False_discovery_rate ).

Best,

Leo

Optimizely
by
‎01-23-2015 January 23, 2015 - edited ‎01-23-2015 January 23, 2015

Hi @lkraav

Apologies if the messaging wasn’t clear enough. With the new Stats Engine, we have transitioned to running a 2-tailed test at 90% significance, which is mathematically equivalent to running two 1-tailed tests at 95% significance. The reason is because controlling the false discovery rate, which fully informs you of the risks of testing many goals and variations at once, has a much more user friendly interpretation in the 2-tailed case.

We now allow you to adjust your significance level threshold for calling winners and losers in your project level settings, so you are free to set this level to correspond to the rate of error that fits best to the business needs within your organization.

• a much more detailed answer can be found here
• and our general blog post describing all our changes is here

- Leo

Optimizely
by
‎01-26-2015 January 26, 2015

Suppose very early on in a test we need to get rid of a variation and delete it entierly, does the stats engine remember that it used to be there and still set a higher statistical bar, or does it now act as though it never existed?

Level 1
by
‎01-26-2015 January 26, 2015

Hi Dave,

Currently, deleting a variation entirely will remove it from multiple testing considerations. In your words, 'acting as though it never existed.' This was done primarily so that users would not be unfairly penalized for making changes early on in their experiments.

On the other hand, pausing a variation will still keep that variation’s results in consideration. In particular, the variation’s results at the time of the pause will be used for all future calculations.

My general advice is the following.

If your experiment has had enough visitors to reasonably suppose you could have learned from any of the variations, pause a variation instead of deleting it. The rational is that any learning or interpretation of results from variations should be accounted for in future analysis of the experiment.

Of course, this is a judgement call, and one can imagine cases where a change to a variation, or a removal of a variation is due to external factors and has nothing to do with the results of your experiment thus far.

The danger of going down this route is results across experiments are less interpretable because it is difficult to keep such subjective judgements consistent. We are working on products that incorporate adaptive pruning and expanding of variations in a principled and more objective way.

Best,

Leo

Optimizely
by
‎01-26-2015 January 26, 2015

Hello,

This concludes my scheduled time to conduct this AMA. I really enjoyed speaking with all of you, and answering all these fantastic questions!

We plan to have many more offerings of educational content, training, and opportunities to directly interact with us on all things stats. Stay tuned.

In the meantime, if you have any further Stats Engine related questions, feel free to post in the general community on Optiverse. We will also be adding answers to all questions from this AMA to our general Stats Engine FAQ.

Best,

Leo

Statistician at Optimizely

Optimizely