Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret
Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal, Valko, Alessandro Lazaric

TL;DR
This paper introduces a parameter-free algorithm for stochastic shortest path problems that achieves minimax regret rates and is nearly horizon-free, advancing the understanding of learning efficiency in goal-oriented stochastic environments.
Contribution
The paper proposes EB-SSP, a novel model-based algorithm that guarantees convergence and achieves minimax regret without prior knowledge of key parameters, and extends horizon-free regret bounds beyond finite-horizon MDPs.
Findings
Achieves minimax regret rate (B_{\u2212} \u221A(SA K))
Parameter-free algorithm does not require prior knowledge of B_{\u2212} or T_{\u2212}
Provides nearly horizon-free regret bounds in stochastic shortest path settings.
Abstract
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate , where is the number of episodes, is the number of states, is the number of actions, and bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of , nor of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems
