Variance-reduced $Q$-learning is minimax optimal

Martin J. Wainwright

arXiv:1906.04697·cs.LG·August 9, 2019·44 cites

Variance-reduced $Q$-learning is minimax optimal

Martin J. Wainwright

PDF

Open Access

TL;DR

This paper presents a variance-reduced $Q$-learning algorithm that achieves near-optimal sample complexity for estimating the optimal $Q$-function in finite MDPs, matching minimax lower bounds up to a logarithmic factor.

Contribution

It introduces a variance-reduced $Q$-learning method with provable minimax optimal sample complexity, improving upon the quartic scaling of ordinary $Q$-learning.

Findings

01

Achieves $ ilde{O}(rac{D}{ ext{epsilon}^2 (1- ext{gamma})^3})$ sample complexity.

02

Matches known minimax lower bounds up to a logarithmic factor.

03

Outperforms ordinary $Q$-learning with quartic scaling in discount complexity.

Abstract

We introduce and analyze a form of variance-reduced $Q$ -learning. For $γ$ -discounted MDPs with finite state space $X$ and action space $U$ , we prove that it yields an $ϵ$ -accurate estimate of the optimal $Q$ -function in the $ℓ_{\infty}$ -norm using $O ((\frac{D}{ϵ ^{2} ( 1 - γ ) ^{3}}) lo g (\frac{D}{( 1 - γ )}))$ samples, where $D = ∣ X ∣ \times ∣ U ∣$ . This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary $Q$ -learning has worst-case quartic scaling in the discount complexity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research