Can Q-learning solve Multi Armed Bantids?
Refael Vivanti

TL;DR
This paper investigates the limitations of current reinforcement learning algorithms in solving Multi-Armed Bandit problems, identifies key variance-related issues, and proposes a novel variance-equalizing method called ASRN to improve performance.
Contribution
The paper reveals why existing RL algorithms struggle with MAB problems and introduces ASRN, a new variance normalization technique that enhances their effectiveness.
Findings
RL algorithms often fail to solve basic MAB problems.
Variance differences cause exploration and estimation issues.
ASRN significantly improves RL performance on MAB tasks.
Abstract
When a reinforcement learning (RL) method has to decide between several optional policies by solely looking at the received reward, it has to implicitly optimize a Multi-Armed-Bandit (MAB) problem. This arises the question: are current RL algorithms capable of solving MAB problems? We claim that the surprising answer is no. In our experiments we show that in some situations they fail to solve a basic MAB problem, and in many common situations they have a hard time: They suffer from regression in results during training, sensitivity to initialization and high sample complexity. We claim that this stems from variance differences between policies, which causes two problems: The first problem is the "Boring Policy Trap" where each policy have a different implicit exploration depends on its rewards variance, and leaving a boring, or low variance, policy is less likely due to its low implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
MethodsConvolution · Q-Learning · Dense Connections · Deep Q-Network
