(More) Efficient Reinforcement Learning via Posterior Sampling
Ian Osband, Daniel Russo, Benjamin Van Roy

TL;DR
This paper explores posterior sampling for reinforcement learning (PSRL), an alternative to optimism-based algorithms, demonstrating its theoretical efficiency and practical superiority through regret bounds and simulations.
Contribution
It introduces PSRL, a simple, computationally efficient algorithm with near-optimal regret bounds, and shows its advantages over existing methods.
Findings
PSRL achieves an $ ilde{O}( au S oot{2} {A T})$ regret bound.
PSRL outperforms existing algorithms in simulations.
The approach naturally encodes prior knowledge.
Abstract
Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an bound on the expected regret, where is time, is the episode length and and are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
