It's About Time: What A/B Test Metrics Estimate
Sebastian Ankargren, Mattias Fr{\aa}nberg, M{\aa}rten Schultzberg

TL;DR
This paper analyzes the estimands of cumulative versus windowed metrics in online A/B tests, revealing trade-offs in statistical power, timing, and interpretability based on experiment context.
Contribution
It provides a detailed comparison of cumulative and windowed metrics, highlighting their respective advantages, limitations, and implications for experimental design in digital environments.
Findings
Cumulative metrics can lead to decreased power with more data due to complex estimands.
Cumulative metrics provide earlier detection of effects and quick signals.
Neither metric type is universally better; choice depends on experiment specifics.
Abstract
Online controlled experiments, or A/B tests, are large-scale randomized trials in digital environments. This paper investigates the estimands of the difference-in-means estimator in these experiments, focusing on scenarios with repeated measurements on users. We compare cumulative metrics that use all post-exposure data for each user to windowed metrics that measure each user over a fixed time window. We analyze the estimands and highlight trade-offs between the two types of metrics. Our findings reveal that while cumulative metrics eliminate the need for pre-defined measurement windows, they result in estimands that are more intricately tied to the experiment intake and runtime. This complexity can lead to counter-intuitive practical consequences, such as decreased statistical power with more observations. However, cumulative metrics offer earlier results and can quickly detect strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Mobile Crowdsensing and Crowdsourcing · Survey Methodology and Nonresponse
