Posterior sampling for reinforcement learning: worst-case regret bounds
Shipra Agrawal, Randy Jia

TL;DR
This paper introduces a posterior sampling algorithm for reinforcement learning that achieves near-optimal worst-case regret bounds in finite, communicating Markov Decision Processes, with theoretical guarantees matching known lower bounds.
Contribution
The paper presents a new posterior sampling algorithm with proven near-optimal worst-case regret bounds for communicating MDPs, including novel anti-concentration results for Dirichlet distributions.
Findings
Regret bound of O(DS\u221a(AT)) for communicating MDPs
Matching the lower bound ( S A T)
Novel anti-concentration results for Dirichlet distributions
Abstract
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of for any communicating MDP with states, actions and diameter . Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon . This result closely matches the known lower bound of . Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
