Variance-reduced $Q$-learning is minimax optimal
Martin J. Wainwright

TL;DR
This paper presents a variance-reduced $Q$-learning algorithm that achieves near-optimal sample complexity for estimating the optimal $Q$-function in finite MDPs, matching minimax lower bounds up to a logarithmic factor.
Contribution
It introduces a variance-reduced $Q$-learning method with provable minimax optimal sample complexity, improving upon the quartic scaling of ordinary $Q$-learning.
Findings
Achieves $ ilde{O}(rac{D}{ ext{epsilon}^2 (1- ext{gamma})^3})$ sample complexity.
Matches known minimax lower bounds up to a logarithmic factor.
Outperforms ordinary $Q$-learning with quartic scaling in discount complexity.
Abstract
We introduce and analyze a form of variance-reduced -learning. For -discounted MDPs with finite state space and action space , we prove that it yields an -accurate estimate of the optimal -function in the -norm using samples, where . This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary -learning has worst-case quartic scaling in the discount complexity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research
