What to do when Experiments are inconclusive?
Sometimes when we roll out experiments and let them sit for a couple weeks, we'll have data collect but the Statistical Significance sits pretty low, between 1-30%. When we leave them running longer and longer, there doesn't seem to be any change to the significance while the visitor count goes up and the conversion rates stay pretty leveled out. At what point can we make a business decision even if the significance doesn't break 90%?
Usually at times like this, I've noticed that the data is pretty neck-and-neck with one variation taking a slight lead over the other. Even if there is a +1% improvement, does that really mean anything if the significance is so low? Usually when running experiments have about the same conversion rate (and the significance is low), I'd say any edge of one over the other is purely chance and both the control and variation don't influence the users any more than the other.
So in these situations, how do you make that decision of picking the control or variation? My gut tells me that if conversions are flat, you can go with either one that aligns with your business model. The problem is, if I'm reporting data to my boss, I'm sure he/she would be curious as to why I picked the outcome that had the "lesser" performance even if there was only an inconclusive +1% improvement. I know that a +1% improvement can make an impact down the road but won't a very low statistical significance indicate the improvement was most likely chance?
Marketing Automation & Optimization
We have done a decent amount of internal education that not every test is a home run, and not every test will even have conclusive results after a large number of conversions. Those are the ones we put back on the shelf to possibly try later.
Opportunity costs are a key consideration to avoid losing momentum in your testing program. You are right to evaluate inconclusive results as toss-ups; our stats engine cannot determine with statistical significance that one variation has a relative conversion advantage over any other. In these situation, you can make your call on which variation to implement based solely on absolute performance (i.e. even though I cannot conclude that is better compared to other variations, I'm comfortable moving forward with implementation based on the conversion rate achieved in a specific tested sample).
Even if you chose the variation with the lower conversion rate, you are still acting on data. You are making that assumption higher conversion rate of the other variation is due to chance, that their is no reliable stastical data to prove that the other variation will continue to outperform the one I select.
It's fine to allow tests to run long to wait for results that directly influence your organization's primary success metric, However, don't test in the dark, use the projected sample size remaining figure presented alongside statistical significance on the results page to make that decision if a test is worth waiting on. You always want to ask yourself the question "What else could I be testing that will more directly influence that metrics that matter most to me."
Check out my #OneMinuteMonday tip on the topic of opportunity cost.
Director, Experience Optimization | BVAccel
1) Really anything that is +/- 5% variance after 2-3 weeks worth of testing begins to feel inconclusive to us. We usually set a minimum visitor count per variation and, once an inconclusive test has his that number, we'll start a 7-day clock. In the first day or two, we're looking for ANY variation in performance -- if it continues to stay absolutely neck-and-neck, we'll usually pull the plug after day 2 or 3 of the 7-day clock (keep in mind: the experiment we are discussing has been running for quite some time at this point. upwards of 2 weeks at least).
2) The opportunity cost that KHATTAAB mentioned is REALLY IMPORTANT; This is your momentum! We've found that a string of inconclusive tests can really zap the general "rah! rah!" testing excitement that we've worked hard to instill within our client's offices. If something isn't working, the statistical significance continues to stay flat and the variance doesn't move at all we tend to get really critical of these experiments and look to conclude them as efficiently as possible.
A parting note on momentum: It's like a boulder -- once your organization/client's org. buys into the idea of continuous experimentation and testing, it rolls downhill. On the flip side, getting that boulder up the hill is tough.
When we have situations where we're seeing very little improvement between two variations rather than deciding which to implement we'll usually take a step back and evaluate the test itself. We'll usually make some adjustments to the designs of the variations, try to come up with something more dramatic so hopefully we'll see clear performance differences between the two.
I'd be interested to know more about the designs you're testing. If you're frequently running tests that show very little difference between the variations then maybe consider making bigger or more dramatic adjustments to your control.
Optimizely Platform Certified