Almost Boltzmann Exploration
Harsh Gupta, Seo Taek Kong, R. Srikant, Weina Wang

TL;DR
This paper introduces a modified Boltzmann exploration algorithm that achieves improved regret bounds in stochastic multi-armed bandit problems, including those with graph-structured feedback, and demonstrates strong empirical performance.
Contribution
A simple modification to Boltzmann exploration based on a variation of the doubling trick achieves better regret bounds and handles graph-structured feedback without prior knowledge.
Findings
Achieves $O(K ext{log}^{1+eta} T)$ regret in stochastic MABs.
Performs as well or better than state-of-the-art in experiments.
Effective in both traditional and graph-structured feedback settings.
Abstract
Boltzmann exploration is widely used in reinforcement learning to provide a trade-off between exploration and exploitation. Recently, in (Cesa-Bianchi et al., 2017) it has been shown that pure Boltzmann exploration does not perform well from a regret perspective, even in the simplest setting of stochastic multi-armed bandit (MAB) problems. In this paper, we show that a simple modification to Boltzmann exploration, motivated by a variation of the standard doubling trick, achieves regret for a stochastic MAB problem with arms, where is a parameter of the algorithm. This improves on the result in (Cesa-Bianchi et al., 2017), where an algorithm inspired by the Gumbel-softmax trick achieves regret. We also show that our algorithm achieves regret in stochastic MAB problems with graph-structured feedback,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems
