Regret Bounds for Restless Markov Bandits

Ronald Ortner; Daniil Ryabko; Peter Auer; R\'emi Munos

arXiv:1209.2693·cs.LG·October 23, 2012·32 cites

Regret Bounds for Restless Markov Bandits

Ronald Ortner, Daniil Ryabko, Peter Auer, R\'emi Munos

PDF

Open Access

TL;DR

This paper introduces an algorithm for the restless Markov bandit problem that achieves near-optimal regret bounds without strong assumptions on the Markov chains, highlighting the suboptimality of index-based policies.

Contribution

The paper presents a new algorithm with $ ilde{O}( oot{T})$ regret for restless Markov bandits, and proves index policies are suboptimal in this setting.

Findings

01

Achieves $ ilde{O}( oot{T})$ regret bound.

02

Index policies are suboptimal for restless Markov bandits.

03

No assumptions other than irreducibility on Markov chains.

Abstract

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after $T$ steps achieves $\tilde{O} (T)$ regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms