How would i report on these inconclusive results?
I've been running an experiment for the past couple of weeks and need to wrap it up to make room for the next one. The significance has been sitting at around 37% the past few days - and the longer it runs, the "~visitors remaining" number appears to go up as well. We'd probably have to let the test run for quite a while to receive anything conclusive, and I'm not willing to spend time on that since we usually run tests for 2-3 weeks at a time.
I'm inclined to shut off the test and revisit the design, since clearly the changes we made were not enough to push the needle very far. However, since there appears to be a slight trend of improvement (+27.3%), I'm willing to kill the experiment and roll out with the variation since it clearly isn't hurting anything. Since the significance is so low, I'm not sure how I'd convince my boss that this is a good idea. In general, I'm afraid to handcuff ourselves to older designs when we're making business decisions to move into refreshed branding. Any thoughts?
Marketing Automation & Optimization
According to other sources, this experiment is at 94% significance.
According to this source, you need only 1 more conversion to achieve significance:
I wanted to weigh in on your question, as its a common one we get.
Overally, I'm inclined to agree with your sense that it might be best to move ahead with the variation demonstrating a directional improvement, as long as you explicitly communicate the following points;
* The result is not statistically significant, and therefore the actual performance change you observe over time will be difficult to demonstrate empirically, and likely will be different than the displayed % improvement the variation shows now. There is a chance (though a small one) that it could end up flattening out to be close to the original.
* While we are not yet statistically sure that the leading variation is winning, we are, as you noted, fairly certain the result "won't hurt anything". I'd describe this decision-making process as 'non-negative logic'; we're comfortable choosing a variation as the new status-quo, even though it isn't a statistically signifcant winner, because we have very good reason to believe it isn't a loser. Therefore its much more likely at this point that the variation will show some sort of improvement than some sort of decrease.
* However, if you could wait by a week or two more, you might be able to get to signficance, which will allow you the confidence to report on a true demonstrated lift to your team, as well as make an iteration decision with confidence.
Is there any external factor that would prevent you from waiting a week or two more?
Hope this is helpful for you,
just noticed the discussion and for what it's worth in our experience conversion counts as low as 66 and 83 mean close to nothing. A difference of 17 conversions could be just pure luck. The CR difference has not stayed stable during the whole testing period either.
I'd really recommend waiting for another 2 weeks to have at least minimum of 120 conversions per variation.
@Noah I myself was struggling with the same thought some time ago and initiated a discussion here talking about calling tests based on the probability that the variation won't lose. And that's essentially what Optie is doing with their new stats engine.
Check it out: https://community.optimizely.com/t5/Strategy-Cultu
Not sure on this, but i'm guessing the reason why different tools show different significance rates comes from the fact that they all use different algorithms for calculating statistical significance and with low traffic false discovery rates is the way to go..
On this, “Since the significance is so low, I'm not sure how I'd convince my boss that this is a good idea” - it is important to recognize that you will never know the “true” conversion rate and lift, only a range of plausible values that it could be. The more data you collect, the smaller the range of plausible values, and thus the less uncertainty you have.
I would recommend thinking in terms of the amount of uncertainty you will tolerate when making a decision - this should be based on the implications of making a wrong decision (ie, saying you have a winner when you truly don’t), as well as the potential value remaining in an experiment (ie, stopping an experiment because the potential lift isn’t meaningful to your business OR continuing an experiment after you’ve found a “winner” because another variation might drive a larger lift). In Optimizely, the difference interval will tell you the range of plausible values of where the difference between original (baseline) and each variation actually lies, which helps you understand these implications on your business.
“I'm inclined to shut off the test and revisit the design, since clearly the changes we made were not enough to push the needle very far.” I would say it’s too soon to make this call – early results are positive, and the potential value remaining would be meaningful to many businesses. If you look at this experiment’s conversion rate trend over time, both the Original and Variation 1 show an upward trend. There are many things that could be driving this effect – one of the more likely effects is due to a business / conversion lifecycle not being instantaneous (ie, it takes a bit of time for someone to convert). When you launch an experiment, visitors can fall into one of three buckets: 1.) new visitors to your site just entering the conversion / buying lifecycle [early research] 2.) visitors in the middle of the conversion lifecycle [close to converting] 3.) visitors at the end of the conversion lifecycle [ready to convert!]. Early on when you evaluate results, often you will find that conversion rate appears lower, as you’ve allocated visitors to a variation, but they haven’t gone through the full conversions lifecycle and thus not converted – this will diminish over time, showing an increasing conversion rate trend, as more and more of these visitors come back and convert.
My recommendation would be you continue the experiment another 1 – 2 weeks to ensure any potential effects have diminished and aren't skewing your results.
Thank you all for the replies! Some great food-for-thought here.
This may be a silly question:
Is there any way to conclusively call an experiment's results "flat"? It seems like there is this spectrum of Success--Inconclusive--Failure, but when would you say conclusively that the variation(s) just have no impact and returns you no projected improvement? Is "inconclusive" the same as "flat" if you've run the experiment long enough? Is there a difference?
Marketing Automation & Optimization
I wanted to expand on @Hudson 's excellent answer, and offer potential criteria for making this decision.
Any time you make a change it results in a causal effect. Sometimes this is very big and sometimes it is very small, potentially so small that 1.) you need a ridiculous amount of data to detect it 2.) it’s not meaningful to you andyour business. #2 is key and ties into your question of, “… when would you say conclusively that the variation(s) just have no impact and returns you no projected improvement." The data you have collected helps quantify the level of uncertainty you have when making a decision, outlines the possible outcomes and tells you the likelihood of these outcomes occurring. This is incredibly useful when you need to determine whether it’s time to discontinue an experiment because you’ve collected enough data to determine that no variation will drive a meaningful lift OR continue an experiment after a “winner” has been discovered because it is still viable that another variation could produce larger lift. A good way of phrasing this is, “We will continue this experiment as long as it is still viable to discover an X% lift from a variation.” What is meaningful in terms of the magnitude of an effect (X in the previous phrase) differs by business - a 0.5% lift might not mean much for a small business but could be massive for a company like Google. Figure out your X.
Numerous ways to go about answering, "how much value is left in my experiment" - looking at the difference interval in Optimizely is a good place to start.
Here's another tool you might find helpful:
If you were to let the experiment run for longer, I would consider adding some additional metrics (aka goals) and also segment the results as well.
For example, what are the other things a user could be doing on that page INSTEAD of accomplishing the goal you've set up. Is there a cancel button? Can you track who is clicking on that?
Also, is there a particular segment that is performing better for either variation? Is that segment the core user persona you are targeting?
Thanks again everybody who responded to my original post – really appreciate all the insight. I am following up with a newer experiment that we’ve been running for 22 days now, and these results so far are in the same league.
I’m actually not terribly surprised by the results so far based on the nature of this landing page we’re testing. We have a pretty high-traffic landing page that has the trendy long-page with heavy scrolling, material design, high-level branding approach. All of the most important information (and form) are at the top of the page – “above the fold” – and then once you start scrolling, you’re getting into the less important info. I wanted to run an experiment to see what happens if we cut out all of that info below the fold and if it really impacts conversion rates; I was more interested in seeing what happens from removing all content as opposed to just testing new low-page content. From what I’m seeing from the data in the past three weeks, there seems to be no indication at all that having a long page is pushing the needle in either direction. I’m lead to believe that the extra info isn’t necessarily hurting nor helping conversions – if anything, it’s just more harmful in general to have noise and needless extra bandwidth/load-times on a landing page.
Not surprising considering a couple years ago we tested out a trial using Crazy Egg and found that on the scroll-map analysis, the majority of our leads dropped off from scrolling almost immediately below the fold, even on much shorter pages that have more of what we’d consider “important” for our users to see.
It’s interesting to see from our data so far that the results are extremely close and will remain that way for some time if we continue to run a test like this. I just don’t think our leads are really interested in or even seeing all of the extra noise, so why keep it? It’s probably a better idea to scrap what we don’t need and then focus on testing what lies above the fold. I’d say it comes down to a business decision if we think it’s important to have the extra branding content on the page.
Curious to see if anyone else would interpret my analysis differently. I'm more used to running tests that are much more visible to users and we'd start seeing some type of trend within a week. I'm inclined to believe that no matter what we have below the fold, it's just not as important as whatever is highest on the page (duh) and maybe these inconclusive results so far are telling us a little more than the possibility of a Winner or Loser. But, I am still willing to wait and continue running the test – just worried about missing out on more important tests out there!
Marketing Automation & Optimization