PAC-Bayesian Offline Contextual Bandits With Guarantees
Otmane Sakhi, Pierre Alquier, Nicolas Chopin

TL;DR
This paper presents a PAC-Bayesian framework for offline contextual bandit learning that offers tighter generalization bounds, guarantees policy improvement, and does not require hyperparameter tuning, demonstrated through extensive experiments.
Contribution
It introduces a novel PAC-Bayesian approach for offline contextual bandits with tighter bounds and guarantees, avoiding intractable derivations of previous methods.
Findings
Tighter generalization bounds than existing methods
Algorithms that optimize bounds directly for policy improvement
Effective in practical scenarios with performance guarantees
Abstract
This paper introduces a new principled approach for off-policy learning in contextual bandits. Unlike previous work, our approach does not derive learning principles from intractable or loose bounds. We analyse the problem through the PAC-Bayesian lens, interpreting policies as mixtures of decision rules. This allows us to propose novel generalization bounds and provide tractable algorithms to optimize them. We prove that the derived bounds are tighter than their competitors, and can be optimized directly to confidently improve upon the logging policy offline. Our approach learns policies with guarantees, uses all available data and does not require tuning additional hyperparameters on held-out sets. We demonstrate through extensive experiments the effectiveness of our approach in providing performance guarantees in practical scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Advanced Bandit Algorithms Research · Machine Learning and Data Classification
