Regret Bounds for Restless Markov Bandits
Ronald Ortner, Daniil Ryabko, Peter Auer, R\'emi Munos

TL;DR
This paper introduces an algorithm for the restless Markov bandit problem that achieves near-optimal regret bounds without strong assumptions on the Markov chains, highlighting the suboptimality of index-based policies.
Contribution
The paper presents a new algorithm with $ ilde{O}( oot{T})$ regret for restless Markov bandits, and proves index policies are suboptimal in this setting.
Findings
Achieves $ ilde{O}( oot{T})$ regret bound.
Index policies are suboptimal for restless Markov bandits.
No assumptions other than irreducibility on Markov chains.
Abstract
We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after steps achieves regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
