Anytime-valid off-policy inference for contextual bandits

Ian Waudby-Smith; Lili Wu; Aaditya Ramdas; Nikos Karampatziakis; and; Paul Mineiro

arXiv:2210.10768·stat.ME·August 19, 2024·1 cites

Anytime-valid off-policy inference for contextual bandits

Ian Waudby-Smith, Lili Wu, Aaditya Ramdas, Nikos Karampatziakis, and, Paul Mineiro

PDF

Open Access 1 Repo

TL;DR

This paper introduces a flexible, anytime-valid framework for off-policy inference in contextual bandits, allowing accurate, adaptive evaluation of policies even during ongoing experiments with dependent data.

Contribution

It develops a comprehensive, martingale-based approach for off-policy evaluation that relaxes previous assumptions and works in real-time with evolving data.

Findings

01

Provides confidence sequences for off-policy mean rewards

02

Derives confidence bands for reward distribution functions

03

Applicable to dependent, drifting contexts in real-time

Abstract

Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_{t}$ to actions $A_{t}$ in an attempt to maximize stochastic rewards $R_{t}$ . This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/csrobust
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Advanced Causal Inference Techniques · Mobile Crowdsensing and Crowdsourcing