Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Max Simchowitz; Kevin Jamieson

arXiv:1905.03814·cs.LG·October 30, 2019·32 cites

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Max Simchowitz, Kevin Jamieson

PDF

Open Access

TL;DR

This paper proves that certain optimistic algorithms for episodic MDPs achieve non-asymptotic, gap-dependent logarithmic regret bounds without relying on diameter or ergodicity assumptions, bridging gap-dependent and minimax rates.

Contribution

It introduces a novel 'clipped' regret decomposition technique that provides gap-dependent regret bounds for a broad class of optimistic algorithms in episodic MDPs, independent of diameter-like quantities.

Findings

01

Achieves logarithmic regret bounds that depend on the gap and are non-asymptotic.

02

Bounds do not depend on diameter-like quantities or ergodicity assumptions.

03

Interpolates smoothly between gap-dependent logarithmic regret and minimax $ ilde{O}( oot{3} ext{HSAT})$ rate.

Abstract

This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on diameter-like quantities or ergodicity, and smoothly interpolate between the gap dependent logarithmic-regret, and the $O (H S A T)$ -minimax rate. The key technique in our analysis is a novel "clipped" regret decomposition which applies to a broad family of recent optimistic algorithms for episodic MDPs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management