MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment
Mohsen Amiri, Konstantin Avrachenkov, Ibtihal El Mimouni, Sindri Magn\'usson

TL;DR
This paper introduces MARBLE, a model for restless bandits with latent Markovian environments, and proves convergence of a new learning algorithm under nonstationary conditions, validated on a recommender system simulator.
Contribution
It proposes MARBLE, a novel extension of RMABs with latent states, and establishes the convergence of Q-learning with Whittle Indices under relaxed indexability assumptions.
Findings
QWI adapts effectively to shifting latent states.
QWI converges to optimal policies in nonstationary environments.
MARBLE's approach is validated on a digital twin recommender system.
Abstract
Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
