Risk-aware product decisions in A/B tests with multiple metrics
M{\aa}rten Schultzberg, Sebastian Ankargren, Mattias Fr{\aa}nberg

TL;DR
This paper develops a theoretical framework for risk-aware decision rules in A/B testing with multiple metrics, ensuring reliable product decisions at Spotify by addressing multiplicity and error control.
Contribution
It introduces a novel decision rule incorporating various metric tests and provides a design and analysis plan to mitigate risks in multi-metric A/B tests.
Findings
Significance level for guardrail metrics with non-inferiority tests does not require adjustment.
Type II error rates must be corrected when including non-inferiority, deterioration, or quality tests.
Monte Carlo simulations demonstrate the effectiveness of the proposed decision framework.
Abstract
In the past decade, AB tests have become the standard method for making product decisions in tech companies. They offer a scientific approach to product development, using statistical hypothesis testing to control the risks of incorrect decisions. Typically, multiple metrics are used in AB tests to serve different purposes, such as establishing evidence of success, guarding against regressions, or verifying test validity. To mitigate risks in AB tests with multiple outcomes, it's crucial to adapt the design and analysis to the varied roles of these outcomes. This paper introduces the theoretical framework for decision rules guiding the evaluation of experiments at Spotify. First, we show that if guardrail metrics with non-inferiority tests are used, the significance level does not need to be multiplicity-adjusted for those tests. Second, if the decision rule includes non-inferiority…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods in Clinical Trials
