Infomax strategies for an optimal balance between exploration and exploitation
Gautam Reddy, Antonio Celani, Massimo Vergassola

TL;DR
This paper demonstrates that an Infomax strategy, Info-p, effectively balances exploration and exploitation in multi-armed bandit problems, achieving optimal bounds by focusing on information about the highest mean reward.
Contribution
The study introduces and validates an Infomax-based policy, Info-p, that optimally balances exploration and exploitation in multi-armed bandit scenarios, outperforming existing methods.
Findings
Info-p saturates known optimal bounds
Info-p compares favorably to existing policies
Focus on highest mean reward enables optimal tradeoffs
Abstract
Proper balance between exploitation and exploration is what makes good decisions, which achieve high rewards like payoff or evolutionary fitness. The Infomax principle postulates that maximization of information directs the function of diverse systems, from living systems to artificial neural networks. While specific applications are successful, the validity of information as a proxy for reward remains unclear. Here, we consider the multi-armed bandit decision problem, which features arms (slot-machines) of unknown probabilities of success and a player trying to maximize cumulative payoff by choosing the sequence of arms to play. We show that an Infomax strategy (Info-p) which optimally gathers information on the highest mean reward among the arms saturates known optimal bounds and compares favorably to existing policies. The highest mean reward considered by Info-p is not the quantity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
