Optimal Baseline Corrections for Off-Policy Contextual Bandits
Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de, Rijke

TL;DR
This paper introduces a unified framework for baseline corrections in off-policy contextual bandits, deriving an optimal estimator that reduces variance and improves offline policy evaluation and learning.
Contribution
It unifies existing control variate methods under a single framework and derives a closed-form variance-optimal estimator for off-policy bandit problems.
Findings
The optimal estimator significantly reduces variance in policy evaluation.
Empirical results confirm improved performance over existing methods.
The framework minimizes data requirements for effective learning.
Abstract
The off-policy learning paradigm allows for recommender systems and general ranking applications to be framed as decision-making problems, where we aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. With unbiasedness comes potentially high variance, and prevalent methods exist to reduce estimation variance. These methods typically make use of control variates, either additive (i.e., baseline corrections or doubly robust methods) or multiplicative (i.e., self-normalisation). Our work unifies these approaches by proposing a single framework built on their equivalence in learning scenarios. The foundation of our framework is the derivation of an equivalent baseline correction for all of the existing control variates. Consequently, our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Healthcare Operations and Scheduling Optimization · Smart Grid Energy Management
