Setting a Primary Goal for an Experiment and Statistical Power
My colleague attended Opticon last week and came back with some interesting updates and learnings. One discovery she brought back to our team was that setting a primary goal for an experiment actually puts more statistical power behind that goal, and having too many secondary goals could result in extended testing times and lower chance for the goals to reach signigicance.
Is there a recommended number of goals to have for tests, and how should I calculate that? Is this number based on traffic or any other metrics? How can I know that the goals I've created are going to be useful, and not slow down my test time by stretching their chance to reach significance too thin?
Thanks in advance!
Conversion Optimization Consultant
Solved! Go to Solution.
Thank you for your question. The notion of number of goals being tied to overall power of the experiment is due to Optimizely’s decision to move to using False Discovery Rates instead of False Positive Rates to find winners and losers.
Since testing more goals and variations increases the chance of finding a spurious result (consider flipping a coin 1 time or 10 times, which one has a higher chance of seeing at least 1 heads?), it takes longer to be as certain of your winners and losers when you have more goals and variations.
The more subtle point is that it takes longer when you add more low signal goals, not high signal ones. The analogy I use is finding needles in a haystack. Adding more needles (high signal goals) to the haystack doesn’t make it harder to find needles, but adding more hay (low signal goals) does.
We are aware that our customers care about fast as well as accurate results, particularly for certain metrics. For this reason we made your primary goal exempt from this multiple testing correction. Your primary goal will reach significance at the same speed regardless of how many other goals you have in your experiment. The intended use of this feature is to have a single goal which will carry more weight than the others in any decisions you make from your experiment results.
As for your question on how many secondary goals to have, there is no universal formula. The reason is that the answer to this question depends on the ROI of secondary goals to your experiment, your traffic, how long you have to run your experiment, and how impactful you believe your variations will be on your secondary goals.
One rule of thumb is the following. If you have a few secondary goals which are very important to your experiment, and your traffic is such that it will take too long to reach significance at a cutoff that is X times more conservative, where X is the number of these important secondary goals, then you’ll want to start minimizing how many other secondary goals you have. For example, say I had 3 secondary goals I really care about, and wanted to test at a 90 significance cutoff. I could pull up Optimizely’s sample size calculator, enter the smallest effect size I thought was reasonable for these goals, and for the significance cutoff enter 100 - (100 - 90) / 3 = 97. If the sample size that Optimizely returns is unreasonable given my site traffic, I should try to add as few goals as possible outside these 3 important secondary goals.
You may also want to consider running different ‘types’ of experiments - ones which are designed to prove a pre-meditated hypothesis (e.g. convince yourself or others that a particular variation does indeed move the needle on a goal) ill call these ‘directed experiments’, and ones which are more exploratory in nature (e.g. how does changing my site in ways X, Y and Z affect user behavior). Directed experiments would warrant being as selective as possible with both the number of variations and number of secondary goals to only the ones which will be useful to prove your thesis. Exploratory experiments encourage more creativity and less pruning. Directed experiments aim to find significant results as quickly as possible and use the ‘rule of thumb’ I mentioned earlier. Exploratory experiments will likely leverage difference intervals (which are correct even if significance has not yet been reached) and have the freedom of running longer. While not always, it is customary for interesting results of explorative efforts to come back as a directed experiment, perhaps in a more developed form. Finally, if you are comfortable assuming that exploratory experiments will have little interaction with directed experiments (perhaps they experiment on different parts of a site), then it is reasonable to have an exploratory experiment running concurrently with directed experiments.
Of course this is only one approach. Individual results will vary.
As for your last question on how to know which goals will be useful, this sort of knowledge often comes from experience in running A/B Tests in a particular industry, and would likely be unique to an organization. Looking over past test results can give insight here.
I hope this information is helpful to organize your thoughts around testing strategy. I realize it’s not a formal calculation. If you do feel that more objective guidance would be useful, I encourage you to submit a product idea and we will see what we can do!
Statistician at Optimizely
Conversion Optimization Consultant