Gap-Dependent Bounds for Q-Learning using Reference-Advantage   Decomposition

Zhong Zheng; Haochen Zhang; Lingzhou Xue

arXiv:2410.07574·stat.ML·March 11, 2025

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Zhong Zheng, Haochen Zhang, Lingzhou Xue

PDF

Open Access 1 Video

TL;DR

This paper establishes new gap-dependent regret bounds for Q-learning algorithms with variance estimators and reference-advantage decomposition, showing improved performance in structured MDPs.

Contribution

It introduces a novel error decomposition framework for gap-dependent analysis of Q-learning with variance bonuses and reference-advantage decomposition, filling a key research gap.

Findings

01

Logarithmic in T regret bounds for UCB-Advantage and Q-EarlySettled-Advantage

02

First gap-dependent analysis for Q-learning with variance estimators

03

Improved bounds on policy switching costs in structured MDPs

Abstract

We study the gap-dependent bounds of two important algorithms for on-policy Q-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the almost optimal $T$ -type regret bound in the worst-case scenario, where $T$ is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for Q-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for Q-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition· slideslive

Taxonomy

TopicsFace and Expression Recognition

MethodsQ-Learning