TL;DR
PROXIMA is a diagnostic framework that assesses the reliability of proxy metrics in online experiments by evaluating their accuracy, stability, and segment-level fragility, improving decision-making in A/B testing.
Contribution
It introduces a novel composite scoring method for proxy reliability that directly audits their impact on launch decisions, unlike traditional surrogate approaches.
Findings
Early engagement metrics achieve 0.80 reliability on Criteo and 0.62 on KuaiRec.
Segment-level heterogeneity is higher in recommendation domains (68%) than advertising (13%).
The composite reliability score outperforms correlation alone in discriminating reliable proxies.
Abstract
Online A/B testing at scale relies on proxy metrics -- short-term, easily-measured signals used in place of slow-moving long-term outcomes. When the proxy-outcome relationship is heterogeneous across user segments, aggregate correlation can mask directional failures akin to Simpson's Paradox, leading to costly ship/no-ship errors. We introduce PROXIMA (Proxy Metric Validation Framework for Online Experiments), a lightweight diagnostic framework that scores proxy reliability through a composite of three complementary dimensions: normalised effect correlation, directional accuracy, and segment-level fragility rate. Unlike surrogate-index approaches that predict long-term treatment effects, PROXIMA directly audits whether a candidate proxy leads to correct launch decisions and flags the user segments where it fails. We validate PROXIMA on two public datasets -- the Criteo Uplift corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
