$\Delta\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies
Olivier Jeunen, Aleksei Ustimenko

TL;DR
This paper introduces $ ext{-}{ m OPE}$, a pairwise off-policy estimation method that reduces variance in policy value difference estimation, improving offline evaluation and learning in recommendation systems.
Contribution
The paper proposes $ ext{-}{ m OPE}$, a novel pairwise off-policy estimation framework that leverages covariance between policies to reduce variance and enhance efficiency.
Findings
$ ext{-}{ m OPE}$ improves estimation accuracy in simulations and real experiments.
Variance reduction leads to better policy evaluation and learning outcomes.
The method outperforms traditional estimators in offline and online settings.
Abstract
The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: . subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWater resources management and optimization · Auction Theory and Applications · Economic Policies and Impacts
