Powerful A/B-Testing Metrics and Where to Find Them
Olivier Jeunen, Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko

TL;DR
This paper introduces a method to evaluate and identify effective A/B testing metrics by analyzing their statistical power and error rates using data from large-scale online experiments on video platforms.
Contribution
The paper proposes a novel pipeline to quantify the utility of supporting metrics in A/B tests by leveraging historical experiment data to assess their statistical power and error rates.
Findings
Identified metrics with high statistical power across platforms.
Provided insights into the utility of various supporting metrics.
Demonstrated the pipeline's effectiveness on large-scale experiment data.
Abstract
Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Software Testing and Debugging Techniques
