Kullback-Leibler upper confidence bounds for optimal sequential allocation
Olivier Capp\'e, Aur\'elien Garivier, Odalric-Ambrym Maillard, R\'emi, Munos, Gilles Stoltz

TL;DR
This paper introduces KL-UCB algorithms for optimal sequential decision-making in multi-armed bandit problems, providing finite-time regret bounds that match theoretical lower bounds and outperform existing methods.
Contribution
It presents a unified analysis of KL-UCB algorithms for different distribution classes, establishing their asymptotic optimality and practical improvements.
Findings
Finite-time regret bounds match theoretical lower bounds.
Algorithms outperform existing methods on bounded reward distributions.
Unified analysis applies to multiple distribution classes.
Abstract
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins [J. R. Stat. Soc. Ser. B Stat. Methodol. 41 (1979) 148-177], based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: the kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins [Adv. in Appl. Math. 6 (1985) 4-22] and Burnetas and Katehakis [Adv. in Appl. Math. 17 (1996) 122-142], respectively. We also investigate the behavior of these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
