Online Learning for Stochastic Shortest Path Model via Posterior Sampling
Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, Haipeng Luo

TL;DR
This paper introduces PSRL-SSP, a novel posterior sampling algorithm for online reinforcement learning in stochastic shortest path problems, providing theoretical regret bounds and outperforming existing optimism-based methods.
Contribution
The paper presents the first posterior sampling-based algorithm for SSP, with a proven Bayesian regret bound and no need for hyper-parameter tuning.
Findings
Achieves a Bayesian regret bound of O(B_* S√A K).
Outperforms previous optimism-based algorithms in numerical experiments.
Requires only prior distribution knowledge, no hyper-parameters.
Abstract
We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of , where is an upper bound on the expected cost of the optimal policy, is the size of the state space, is the size of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management
