Stochastic Shortest Path: Minimax, Parameter-Free and Towards   Horizon-Free Regret

Jean Tarbouriech; Runlong Zhou; Simon S. Du; Matteo Pirotta; Michal; Valko; Alessandro Lazaric

arXiv:2104.11186·cs.LG·December 13, 2021·5 cites

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal, Valko, Alessandro Lazaric

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces a parameter-free algorithm for stochastic shortest path problems that achieves minimax regret rates and is nearly horizon-free, advancing the understanding of learning efficiency in goal-oriented stochastic environments.

Contribution

The paper proposes EB-SSP, a novel model-based algorithm that guarantees convergence and achieves minimax regret without prior knowledge of key parameters, and extends horizon-free regret bounds beyond finite-horizon MDPs.

Findings

01

Achieves minimax regret rate (B_{\u2212} \u221A(SA K))

02

Parameter-free algorithm does not require prior knowledge of B_{\u2212} or T_{\u2212}

03

Provides nearly horizon-free regret bounds in stochastic shortest path settings.

Abstract

We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate $\tilde{O} (B_{⋆} S A K)$ , where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions, and $B_{⋆}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{⋆}$ , nor of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems