An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
Isaac J. Sledge, Jose C. Principe

TL;DR
This paper introduces an information-theoretic exploration strategy for stochastic multi-armed bandits that optimally balances exploration and exploitation, achieving logarithmic regret through a simulated-annealing-like parameter update.
Contribution
It proposes a novel exploration strategy based on the value of information criterion that guarantees optimal regret in stochastic multi-armed bandits.
Findings
Achieves logarithmic regret with a proper cooling schedule.
Balances exploration and exploitation effectively.
Demonstrates the effectiveness through theoretical analysis.
Abstract
In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
