Optimistic Whittle Index Policy: Online Learning for Restless Bandits
Kai Wang, Lily Xu, Aparna Taneja, Milind Tambe

TL;DR
This paper introduces UCWhittle, an online learning algorithm for restless bandits that uses an upper confidence bound approach to estimate transition dynamics and compute optimistic Whittle indices, achieving sublinear regret.
Contribution
It presents the first online learning algorithm for RMABs based on Whittle index policy with UCB, addressing unknown transition dynamics.
Findings
UCWhittle achieves sublinear regret of O(H√T log T).
It outperforms existing online learning baselines in three domains.
Demonstrated effectiveness on a real-world maternal and childcare dataset.
Abstract
Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear frequentist regret to solve RMABs with unknown transitions in episodes with a constant horizon . Empirically, we demonstrate that UCWhittle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research
