Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms
Tengyu Xu, Yingbin Liang

TL;DR
This paper provides the first non-asymptotic convergence analysis for two timescale value-based reinforcement learning algorithms, establishing optimal sample complexity bounds for linear and nonlinear TDC and Greedy-GQ under Markovian sampling.
Contribution
It introduces novel non-asymptotic analysis and establishes optimal sample complexity bounds for nonlinear TDC and Greedy-GQ algorithms in reinforcement learning.
Findings
Linear TDC achieves $oxed{ ext{O}( ext{}rac{1}{ ext{ } ext{epsilon}} ext{ } ext{log}(rac{1}{ ext{ } ext{epsilon}}) ext{)}}$ sample complexity.
Nonlinear TDC and Greedy-GQ attain $oxed{ ext{O}(rac{1}{ ext{ } ext{epsilon}}^2)}$ sample complexity.
First non-asymptotic convergence result for nonlinear TDC under Markovian sampling.
Abstract
Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with diminishing or accuracy-dependent stepsize. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with accuracy-independent constant stepsize. For linear TDC, we provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research
