A Statistical Analysis of Polyak-Ruppert Averaged Q-learning
Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, Michael I. Jordan

TL;DR
This paper provides a detailed statistical analysis of Polyak-Ruppert averaged Q-learning, establishing a functional central limit theorem, online inference methods, and optimal error bounds in tabular Markov decision processes.
Contribution
It introduces a functional CLT for averaged Q-learning, shows its asymptotic efficiency as an RAL estimator, and derives nonasymptotic error bounds, extending to entropy-regularized Q-learning.
Findings
Functional CLT for averaged Q-learning process
Online inference method based on the CLT
Instance-dependent lower bounds for error
Abstract
We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that is the regular asymptotically linear (RAL) estimator for the optimal Q-value function that has the most efficient influence function. We present a nonasymptotic analysis for the error, , showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsControl Systems and Identification · Reinforcement Learning in Robotics · Receptor Mechanisms and Signaling
MethodsQ-Learning
