Can Q-learning solve Multi Armed Bantids?

Refael Vivanti

arXiv:2110.10934·cs.LG·October 22, 2021

Can Q-learning solve Multi Armed Bantids?

Refael Vivanti

PDF

Open Access

TL;DR

This paper investigates the limitations of current reinforcement learning algorithms in solving Multi-Armed Bandit problems, identifies key variance-related issues, and proposes a novel variance-equalizing method called ASRN to improve performance.

Contribution

The paper reveals why existing RL algorithms struggle with MAB problems and introduces ASRN, a new variance normalization technique that enhances their effectiveness.

Findings

01

RL algorithms often fail to solve basic MAB problems.

02

Variance differences cause exploration and estimation issues.

03

ASRN significantly improves RL performance on MAB tasks.

Abstract

When a reinforcement learning (RL) method has to decide between several optional policies by solely looking at the received reward, it has to implicitly optimize a Multi-Armed-Bandit (MAB) problem. This arises the question: are current RL algorithms capable of solving MAB problems? We claim that the surprising answer is no. In our experiments we show that in some situations they fail to solve a basic MAB problem, and in many common situations they have a hard time: They suffer from regression in results during training, sensitivity to initialization and high sample complexity. We claim that this stems from variance differences between policies, which causes two problems: The first problem is the "Boring Policy Trap" where each policy have a different implicit exploration depends on its rewards variance, and leaving a boring, or low variance, policy is less likely due to its low implicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control

MethodsConvolution · Q-Learning · Dense Connections · Deep Q-Network