A Multistep Lyapunov Approach for Finite-Time Analysis of Biased Stochastic Approximation
Gang Wang, Bingcong Li, Georgios B. Giannakis

TL;DR
This paper introduces a multistep Lyapunov approach for finite-time analysis of biased stochastic approximation algorithms, providing the first non-asymptotic error bounds for unmodified TD and Q-learning with linear function approximation under general conditions.
Contribution
It develops a novel multistep Lyapunov framework that enables finite-time analysis of biased stochastic approximation algorithms in reinforcement learning.
Findings
First finite-time error bounds for unmodified TD and Q-learning.
Applicable under general Markov chain mixing conditions.
Works with nonlinear function approximators from any initial distribution.
Abstract
Motivated by the widespread use of temporal-difference (TD-) and Q-learning algorithms in reinforcement learning, this paper studies a class of biased stochastic approximation (SA) procedures under a mild "ergodic-like" assumption on the underlying stochastic noise sequence. Building upon a carefully designed multistep Lyapunov function that looks ahead to several future updates to accommodate the stochastic perturbations (for control of the gradient bias), we prove a general result on the convergence of the iterates, and use it to derive non-asymptotic bounds on the mean-square error in the case of constant stepsizes. This novel looking-ahead viewpoint renders finite-time analysis of biased SA algorithms under a large family of stochastic perturbations possible. For direct comparison with existing contributions, we also demonstrate these bounds by applying them to TD- and Q-learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Neural dynamics and brain function
MethodsQ-Learning
