Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

Shulun Chen; Runlong Zhou; Zihan Zhang; Maryam Fazel; Simon S. Du

arXiv:2506.06521·cs.LG·June 10, 2025

Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

PDF

Open Access

TL;DR

This paper introduces a variance-aware, gap-dependent regret bound for episodic MDPs using the MVP algorithm, highlighting the importance of maximum conditional variance in learning efficiency and establishing a matching lower bound.

Contribution

It provides the first variance-aware gap-dependent regret bounds for episodic MDPs and introduces a novel analysis technique based on weighted suboptimality gaps.

Findings

01

The MVP algorithm achieves a regret bound that depends on maximum conditional variance.

02

A lower bound shows the necessity of variance dependence even with zero unconditional variance.

03

The analysis can be adapted to other algorithms for improved regret guarantees.

Abstract

We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of $\tilde{O} Δ_{h} (s, a) > 0 \sum \frac{H ^{2} lo g K \land Var _{m a x}^{c}}{Δ _{h} ( s , a )} + Δ_{h} (s, a) = 0 \sum \frac{H ^{2} \land Var _{m a x}^{c}}{Δ _{min}} + S A H^{4} (S \lor H) lo g K,$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Here, $Δ_{h} (s, a) = V_{h}^{*} (a) - Q_{h}^{*} (s, a)$ represents the suboptimality gap and $Δ_{min} := min_{Δ_{h} (s, a) > 0} Δ_{h} (s, a)$ . The term $Var_{m a x}^{c}$ denotes the maximum conditional total variance, calculated as the maximum over all $(π, h, s)$ tuples of the expected total variance under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Risk and Portfolio Optimization