A Note on KL-UCB+ Policy for the Stochastic Bandit
Junya Honda

TL;DR
This paper proves the asymptotic optimality of the KL-UCB+ policy for stochastic bandits using techniques similar to those for other policies, explaining its empirical performance improvements.
Contribution
It provides a simple proof of the asymptotic optimality of KL-UCB+ policy, clarifying its theoretical performance.
Findings
KL-UCB+ empirically outperforms KL-UCB
Asymptotic optimality of KL-UCB+ established
Proof uses techniques similar to other bandit policies
Abstract
A classic setting of the stochastic K-armed bandit problem is considered in this note. In this problem it has been known that KL-UCB policy achieves the asymptotically optimal regret bound and KL-UCB+ policy empirically performs better than the KL-UCB policy although the regret bound for the original form of the KL-UCB+ policy has been unknown. This note demonstrates that a simple proof of the asymptotic optimality of the KL-UCB+ policy can be given by the same technique as those used for analyses of other known policies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems
