Restless Bandits with Individual Penalty Constraints: Near-Optimal Indices and Deep Reinforcement Learning
Nida Zamir, I-Hong Hou

TL;DR
This paper introduces a novel index policy for restless bandits with individual constraints, combining theoretical optimality and deep learning for practical resource management in wireless networks.
Contribution
It proposes the Penalty-Optimal Whittle index policy that is computationally efficient, asymptotically optimal, and adaptable via deep reinforcement learning.
Findings
POW index policy is near-optimal in simulations.
The policy satisfies all individual penalty constraints.
Deep RL effectively learns the POW index online.
Abstract
This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user's transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW indices offline without any need for online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
