Evaluating Decision Rules Across Many Weak Experiments
Winston Chou, Colin Gray, Nathan Kallus, Aur\'elien Bibaut, Simon Ejdemyr

TL;DR
This paper introduces a new method for evaluating decision rules in large-scale digital experiments, focusing on maximizing cumulative business metrics despite noisy data, demonstrated through a Netflix case study.
Contribution
It develops a cross-validation estimator for decision rule evaluation that reduces bias in noisy, weak-signal experiments, enabling better decision-making in large experimentation programs.
Findings
The proposed estimator outperforms naive methods in bias reduction.
Applying the method to Netflix data suggested a 33% increase in cumulative metrics.
The new decision rule was adopted, improving business outcomes.
Abstract
Technology firms conduct randomized controlled experiments ("A/B tests") to learn which actions to take to improve business outcomes. In firms with mature experimentation platforms, experimentation programs can consist of many thousands of tests. To effectively scale experimentation, firms rely on decision rules: standard operating procedures for mapping the results of an experiment to a choice of treatment arm to launch to the general user population. Despite the critical role of decision rules in translating experimentation into business decisions, rigorous guidance on how to evaluate and choose decision rules is scarce. This paper proposes to evaluate decision rules based on their cumulative returns to business north star metrics. Although intuitive and easy to explain to decision-makers, this quantity can be difficult to estimate, especially when experiments have weak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
