
TL;DR
This paper advocates for using Bonferroni correction in online experiments, highlighting its simplicity, interpretability, and empirical power efficiency compared to more complex methods.
Contribution
It provides a comprehensive argument and empirical evidence supporting Bonferroni correction as a practical and effective method for controlling false positives in online experimentation.
Findings
Bonferroni is the simplest FWER-controlling method with unconditional confidence intervals.
Power loss depends on how the correction family is specified and the number of non-null metrics.
Restricting the family to success metrics improves deployment rates by 4-5 percentage points.
Abstract
We argue that Bonferroni correction is a better choice for online experimentation than it is commonly given credit for. The case rests on four considerations. First, it is the simplest broadly implementable FWER-controlling method that produces unconditional simultaneous confidence intervals for every metric. Second, in a well-specified decision framework, guardrail and quality metrics use intersection-union logic and cannot inflate the false positive rate, so the Bonferroni denominator is the number of success metrics only, not the total metric count. Third, it is uniquely tractable for pre-experiment sample size calculations. Fourth, we contextualise the power cost empirically. Drawing on a simulation study and an empirical analysis of 1,296 experiments run on Spotify's experimentation platform, Confidence, we show that the power loss relative to more sophisticated FWER methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
