Reinforcement Learning for Finite-Horizon Restless Multi-Armed   Multi-Action Bandits

Guojun Xiong; Jian Li; Rahul Singh

arXiv:2109.09855·cs.LG·March 25, 2022·1 cites

Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

Guojun Xiong, Jian Li, Rahul Singh

PDF

Open Access

TL;DR

This paper introduces a new index policy for finite-horizon restless multi-armed bandits with multiple actions, providing asymptotic optimality and a learning algorithm that outperforms existing methods in regret and efficiency.

Contribution

It proposes the Occupancy-Measured-Reward Index Policy for complex bandits and develops the R(MA)^2B-UCB algorithm for unknown parameters, with proven asymptotic optimality and improved performance.

Findings

01

The policy is well-defined even if MDPs are not indexable.

02

The learning algorithm achieves sub-linear regret.

03

Experimental results outperform existing algorithms in regret and runtime.

Abstract

We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management