Optimally Confident UCB: Improved Regret for Finite-Armed Bandits
Tor Lattimore

TL;DR
This paper introduces a new UCB-based algorithm for finite-armed bandits that achieves optimal regret bounds both on a problem-dependent and worst-case basis, combining theoretical rigor with practical efficiency.
Contribution
The paper proposes the first algorithm that simultaneously attains order-optimal problem-dependent and worst-case regret for stochastic finite-armed bandits.
Findings
The algorithm is simple and efficient.
It empirically outperforms existing methods.
Theoretical analysis confirms optimal regret bounds.
Abstract
I present the first algorithm for stochastic finite-armed bandits that simultaneously enjoys order-optimal problem-dependent regret and worst-case regret. Besides the theoretical results, the new algorithm is simple, efficient and empirically superb. The approach is based on UCB, but with a carefully chosen confidence parameter that optimally balances the risk of failing confidence intervals against the cost of excessive optimism.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
