Refined PAC-Bayes Bounds for Offline Bandits
Amaury Gouverneur, Tobias J. Oechtering, and Mikael Skoglund

TL;DR
This paper introduces refined PAC-Bayesian bounds for off-policy bandit learning, improving probabilistic reward estimates by optimizing parameters through a new discretization technique, leading to near-optimal bounds.
Contribution
It develops two parameter-free PAC-Bayes bounds for offline bandits using a novel parameter optimization method based on discretization.
Findings
Bounds are nearly optimal, matching the rate of data realization.
Two bounds are provided: one based on Hoeffding-Azuma, another on Bernstein.
The bounds improve upon previous results by Seldin et al. (2010).
Abstract
In this paper, we present refined probabilistic bounds on empirical reward estimates for off-policy learning in bandit problems. We build on the PAC-Bayesian bounds from Seldin et al. (2010) and improve on their results using a new parameter optimization approach introduced by Rodr\'iguez et al. (2024). This technique is based on a discretization of the space of possible events to optimize the "in probability" parameter. We provide two parameter-free PAC-Bayes bounds, one based on Hoeffding-Azuma's inequality and the other based on Bernstein's inequality. We prove that our bounds are almost optimal as they recover the same rate as would be obtained by setting the "in probability" parameter after the realization of the data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Distributed Sensor Networks and Detection Algorithms
