Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik

TL;DR
This paper analyzes the challenge of off-policy evaluation in contextual bandits without reward models, establishing fundamental limits and proposing a new estimator that leverages existing reward models for improved accuracy.
Contribution
It introduces the SWITCH estimator that uses existing reward models to enhance off-policy evaluation, outperforming traditional methods in diverse datasets.
Findings
Minimax lower bound on MSE established for agnostic setting
Switch estimator achieves better bias-variance tradeoff
Empirical results show significant performance improvements
Abstract
We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Advanced Causal Inference Techniques · Machine Learning and Algorithms
