Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs
Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

TL;DR
This paper introduces a variance-aware, gap-dependent regret bound for episodic MDPs using the MVP algorithm, highlighting the importance of maximum conditional variance in learning efficiency and establishing a matching lower bound.
Contribution
It provides the first variance-aware gap-dependent regret bounds for episodic MDPs and introduces a novel analysis technique based on weighted suboptimality gaps.
Findings
The MVP algorithm achieves a regret bound that depends on maximum conditional variance.
A lower bound shows the necessity of variance dependence even with zero unconditional variance.
The analysis can be adapted to other algorithms for improved regret guarantees.
Abstract
We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of where is the planning horizon, is the number of states, is the number of actions, and is the number of episodes. Here, represents the suboptimality gap and . The term denotes the maximum conditional total variance, calculated as the maximum over all tuples of the expected total variance under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Risk and Portfolio Optimization
