Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization
Soham Bonnerjee, Zhipeng Lou, and Wei Biao Wu

TL;DR
This paper develops a comprehensive theoretical framework for Q-learning with a class of decaying learning rates, including the recently popular LD2Z schedule, providing sharp error bounds, CLT, and Gaussian approximation results.
Contribution
It introduces a unified analysis for power-law decay schedules in Q-learning, establishing their statistical properties and inference capabilities.
Findings
Sharp non-asymptotic error bounds for Q-learning with PD2Z-$ u$.
Central limit theorem for tail Polyak-Ruppert averaging estimator.
Time-uniform Gaussian approximation for Q-learning iterates.
Abstract
Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant () or polynomially decaying () learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: ) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-: ). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
