Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection   and Learning

Otmane Sakhi; Imad Aouali; Pierre Alquier; Nicolas Chopin

arXiv:2405.14335·stat.ML·November 1, 2024

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a logarithmically smoothed estimator for offline contextual bandit evaluation that provides tighter bounds and improves policy selection and learning by leveraging a novel pessimistic, empirical approach.

Contribution

It develops a new logarithmically smoothed importance weighting estimator with tighter bounds for offline policy evaluation and learning, advancing the pessimistic approach.

Findings

01

The LS estimator yields tighter bounds than existing methods.

02

Empirical results show improved policy selection and learning.

03

The bounds are broadly applicable to various estimators.

Abstract

This work investigates the offline formulation of the contextual bandit problem, where the goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. Motivated by critical applications, we move beyond point estimators. Instead, we adopt the principle of pessimism where we construct upper bounds that assess a policy's worst-case performance, enabling us to confidently select and learn improved policies. Precisely, we introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators. These bounds are general enough to cover most existing estimators and pave the way for the development of new ones. In particular, our pursuit of the tightest bound within this class motivates a novel estimator (LS), that logarithmically smooths large importance weights. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

otmhi/offpolicy_ls
jaxOfficial

Videos

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Advanced Causal Inference Techniques