Faster Q-Learning Algorithms for Restless Bandits
Parvish Kakarapalli, Devendra Kayande, Rahul Meshram

TL;DR
This paper introduces faster Q-learning algorithms, including variants and exploration policies, for restless multi-armed bandits, demonstrating improved convergence rates through numerical experiments.
Contribution
It proposes new Q-learning variants and explores their effectiveness with UCB exploration in the context of index learning for RMABs.
Findings
Q-learning with UCB converges faster than with ε-greedy.
PhaseQL with UCB achieves the fastest convergence among tested algorithms.
Numerical examples validate the improved convergence rates of proposed methods.
Abstract
We study the Whittle index learning algorithm for restless multi-armed bandits (RMAB). We first present Q-learning algorithm and its variants -- speedy Q-learning (SQL), generalized speedy Q-learning (GSQL) and phase Q-learning (PhaseQL). We also discuss exploration policies -- -greedy and Upper confidence bound (UCB). We extend the study of Q-learning and its variants with UCB policy. We illustrate using numerical example that Q-learning with UCB exploration policy has faster convergence and PhaseQL with UCB have fastest convergence rate. We next extend the study of Q-learning variants for index learning to RMAB. The algorithm of index learning is two-timescale variant of stochastic approximation, on slower timescale we update index learning scheme and on faster timescale we update Q-learning assuming fixed index value. We study constant stepsizes two timescale stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Cognitive Radio Networks and Spectrum Sensing · Smart Grid Energy Management
MethodsQ-Learning
