Multi-Armed Bandits with Self-Information Rewards
Nir Weinberger, Michal Yemini

TL;DR
This paper introduces a new multi-armed bandit model where rewards are based on the self-information of symbols, proposing algorithms with performance guarantees and analyzing their asymptotic behavior.
Contribution
The paper develops UCB-based algorithms for the informational multi-armed bandit model, addressing bias correction and unknown alphabet size, with theoretical regret bounds and asymptotic analysis.
Findings
Algorithms achieve sublinear regret bounds.
Asymptotic behavior matches Lai-Robbins lower bound in Bernoulli case.
Numerical results confirm theoretical performance guarantees.
Abstract
This paper introduces the informational multi-armed bandit (IMAB) model in which at each round, a player chooses an arm, observes a symbol, and receives an unobserved reward in the form of the symbol's self-information. Thus, the expected reward of an arm is the Shannon entropy of the probability mass function of the source that generates its symbols. The player aims to maximize the expected total reward associated with the entropy values of the arms played. Under the assumption that the alphabet size is known, two UCB-based algorithms are proposed for the IMAB model which consider the biases of the plug-in entropy estimator. The first algorithm optimistically corrects the bias term in the entropy estimation. The second algorithm relies on data-dependent confidence intervals that adapt to sources with small entropy values. Performance guarantees are provided by upper bounding the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
