Volatile winner results in daily data
we have an integration with Omniture and are exporting our Optimizely testing data from Omniture into Excel on a daily basis. Doing this we often face quite volitile results, means we have a big uplit in conversions for the variation during the first days and the similar uplift for the original a few days later. To make it even worse, often the winner is fundamentally changing every second day. What does this course of test mean for the significance and how we could cope with it explaining the results?
Solved! Go to Solution.
Hi Olga, how much traffic and how many conversions has each variation received so far? What does the confidence level and difference interval look like in optimizely. It may just be that it needs to run longer so more data can accumulate.
It's common in our tests for us to see big swings in data in the first few days after a test has launched. We usually try and let things run a couple of weeks so that each variation receives a significant number of conversions before we declare a winner. Obviously it varies depending on traffic and activity.
Hi Olga! I think it will help if I answer your questions in reverse order.
First: "... how we could cope with it explaining the results" This trend of "volatile results early on that stabilize over time" is well described via the law of large numbers, which states the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as the number of trials increases.
I find visualizations help. Here’s the outcome of running a simulation of a fair coin toss, where we know each side (heads, tails) has an equal probability (0.5) of being flipped. For the sake of example, let's pretend our conversion goal is to land a Head with each flip. We simulate 1,000 coin flips, and trend the cumulative conversion rate (defined as # Heads Flipped / # Total Flips). The black line shows the cumulative conversion rate after each trial, while the red line shows the expected conversion rate. As you can see, volatility early on (in this simulation, the first 10 coin flips resulted in 8 heads!), which stabilizes over time.
Be mindful of how low volume and a non-representative sample can cause volatile results early on – this leads into your question of, "What does this course of test mean for the significance" - as a best practice, you should set parameters for your experiment's stopping rule. Your stopping rule should determine at what point you stop an experiment, meaning you either accept a winner or determine it is unlikely you will discover a winner with a meaningful-enough magnitude of lift if you continue this experiment (meaning there is no value in further exploration). Fortunately Optimizely's Stats Engine handles the burden of accounting for repeated significance testing (ie, constantly evaluating the data to determine "significance") as well as accounting for the multiple comparison problem (ie, multiple goals and variations), which is a big help (not every tool on the market does this for you!). You still need to plan for your business & customer lifecycle to ensure your test population (who is being exposed to your experiment) encompasses a representative sample of your site's visitors. By that I mean, whomever will be exposed to a potential winning variation should be represented in your test (ie, run on weekends and weekdays, all times of the day, holiday vs. non holiday, traffic source [SEO, PPC, email, etc]).
Hope this helps!
do you think that our problem is due to looking at the results on a daily basis?
I picked one test, where we had at about 60 conversions per branch daily.
I attached a graph what the conversion rates look like in this test. Even when we have a clear winner in overall results, the credibility of them is suffering when we look at the daily conversion rates. The question is: How can the variation be a winner when it was a clear loser during the days xy? What is the best practice for this? Is the solution not to look into results until we have reached the sample size we calculated before?
Thank you a lot,
thank you a lot for this detailed answer:-)
"as a best practice, you should set parameters for your experiment's stopping rule. Your stopping rule should determine at what point you stop an experiment, meaning you either accept a winner or determine it is unlikely you will discover a winner with a meaningful-enough magnitude of lift if you continue this experiment (meaning there is no value in further exploration)."
--> when we set a stopping rule (let's say, 10.000 visitors), does it mean, the moment we reach this number should be the first time we share the results, because everything before is not significant?
On this, “The question is: How can the variation be a winner when it was a clear loser during the days xy” – simple answer is, it wasn’t a clear loser in days xy Best practice is, be mindful that day-to-day variance is normal, and will stabilize over time – the more data you use when evaluating results, the less the uncertainty. You should still be looking at your experiment results daily, as poor results early on might indicate an issue with your experiment (ie, flickering occurring, increase page load time, UX errors [such as button click not working], etc) that would negatively impact UX, and thus be detrimental to your customer experience and KPI’s.
Simulations do wonders for explaining this “day to day variance which stabilizes over time” – in this example, I’ve simulated day-to-day results for just one variation. I’ve aggregated daily results to show cumulative (ongoing) conversion rate over time. I set the simulation to expose 1,000 visitors per day, for 21 days, and set 5% as the conversion rate. The green line shows the known conversion rate (5%), the blue line shows the conversion rate for just that day, while the red line shows the ongoing (cumulative) conversion rate through the experiment. As you can see, lots of a ups and downs on the blue line, whereas the underlying trend in the red line shows that as time and more data is collected, the conversion rate of the simulation gravitates towards it’s expected value, even though you see tons of day-to-day variance throughout the experiment. Conversion rate stabilizes over time
When making a decision based on an A/B test your goal is to quantify the amount of uncertainty based on the data that you have collected. Early on in an experiment, and on a day-to-day basis, there is a lot of uncertainty due to limited data – normal variance will seem to show an underlying trend (ie, one version is a clear winner / loser), when in reality you simply don’t have enough data to backup such claims. You should be determining the amount of uncertainty you will tolerate when making a decision based on the implications of making an incorrect decision (ie, saying you have a winner when you don’t) – in Optimizely, the difference interval will tell you the range of plausible values of where the difference between original (baseline) and each variation actually lies, which helps you understand these implications on your business.
Hope this helps!
Hi Olga, in this example, what does the data look like when you compilie it? is there a clear winner? Also, what does the Confidence level and difference interval in optimizely look like?
My rule of thumb on evangelizing A/B testing is share early and often, but always include context. Far too often you see people comparing the difference between the average conversion rate of each variation – the reality is, you will never be able to determine the “true” conversion rate, only a range of plausible values that it could be. The more data you collect, the smaller the range of plausible values, and thus the less uncertainty you have. Think about a coin flip – you have a 50/50 shot of landing on either side in a fair coin. If you do 10 flips, you expect 5 per side. If you happen to flip 2 heads and 8 tails, you would probably say, “that likely happened just by chance”. If, however, you do 10,000 flips, where 2,000 land heads and 8,000 land tails, you are more likely to scream, “THIS ISN’T A FAIR COIN!!!”
Phrasing results as, “Based on the data we’ve collected, we are X% confident that the lift observed (ie, the result of comparing variations) is between Y% and Z%” will do wonders for explaining the value of an experiment (either stopping and calling a “winner”, letting the experiment continue running and collecting data in hopes that a variation becomes a “winner”, or determining there isn’t enough value remaining to continue the experiment [ie, determining that even if you continue the experiment and eventually find a winner, the potential lift value of that winner is insignificant to your business, and thus you should move to another test])