Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning
Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin

TL;DR
This paper introduces a logarithmically smoothed estimator for offline contextual bandit evaluation that provides tighter bounds and improves policy selection and learning by leveraging a novel pessimistic, empirical approach.
Contribution
It develops a new logarithmically smoothed importance weighting estimator with tighter bounds for offline policy evaluation and learning, advancing the pessimistic approach.
Findings
The LS estimator yields tighter bounds than existing methods.
Empirical results show improved policy selection and learning.
The bounds are broadly applicable to various estimators.
Abstract
This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Advanced Causal Inference Techniques
