Optimal Baseline Corrections for Off-Policy Contextual Bandits

Shashank Gupta; Olivier Jeunen; Harrie Oosterhuis; and Maarten de; Rijke

arXiv:2405.05736·cs.LG·August 15, 2024

Optimal Baseline Corrections for Off-Policy Contextual Bandits

Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de, Rijke

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified framework for baseline corrections in off-policy contextual bandits, deriving an optimal estimator that reduces variance and improves offline policy evaluation and learning.

Contribution

It unifies existing control variate methods under a single framework and derives a closed-form variance-optimal estimator for off-policy bandit problems.

Findings

01

The optimal estimator significantly reduces variance in policy evaluation.

02

Empirical results confirm improved performance over existing methods.

03

The framework minimizes data requirements for effective learning.

Abstract

The off-policy learning paradigm allows for recommender systems and general ranking applications to be framed as decision-making problems, where we aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. With unbiasedness comes potentially high variance, and prevalent methods exist to reduce estimation variance. These methods typically make use of control variates, either additive (i.e., baseline corrections or doubly robust methods) or multiplicative (i.e., self-normalisation). Our work unifies these approaches by proposing a single framework built on their equivalence in learning scenarios. The foundation of our framework is the derivation of an equivalent baseline correction for all of the existing control variates. Consequently, our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shashankg7/recsys2024_optimal_baseline
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Healthcare Operations and Scheduling Optimization · Smart Grid Energy Management