Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes
Cyrille Kone, Kevin Jamieson

TL;DR
This paper introduces a computationally efficient, asymptotically optimal algorithm for policy identification in tabular MDPs, improving upon prior methods in sample complexity and dependence on confidence parameters.
Contribution
Proposes a novel randomized posterior sampling algorithm that achieves asymptotic optimality and practical efficiency for policy identification in finite-horizon MDPs.
Findings
Achieves asymptotic optimality in sample complexity.
Runs in $O(S^2AH)$ per episode, matching standard approaches.
Guarantees remain meaningful in the asymptotic regime, avoiding sub-optimal dependence on $ ext{log}(1/ ext{delta})$.
Abstract
We study the -PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings () but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on . We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
