Anytime-valid off-policy inference for contextual bandits
Ian Waudby-Smith, Lili Wu, Aaditya Ramdas, Nikos Karampatziakis, and, Paul Mineiro

TL;DR
This paper introduces a flexible, anytime-valid framework for off-policy inference in contextual bandits, allowing accurate, adaptive evaluation of policies even during ongoing experiments with dependent data.
Contribution
It develops a comprehensive, martingale-based approach for off-policy evaluation that relaxes previous assumptions and works in real-time with evolving data.
Findings
Provides confidence sequences for off-policy mean rewards
Derives confidence bands for reward distribution functions
Applicable to dependent, drifting contexts in real-time
Abstract
Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts to actions in an attempt to maximize stochastic rewards . This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Advanced Causal Inference Techniques · Mobile Crowdsensing and Crowdsourcing
