Sample Complexity Bounds for Two Timescale Value-based Reinforcement   Learning Algorithms

Tengyu Xu; Yingbin Liang

arXiv:2011.05053·cs.LG·November 11, 2020·5 cites

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

Tengyu Xu, Yingbin Liang

PDF

Open Access

TL;DR

This paper provides the first non-asymptotic convergence analysis for two timescale value-based reinforcement learning algorithms, establishing optimal sample complexity bounds for linear and nonlinear TDC and Greedy-GQ under Markovian sampling.

Contribution

It introduces novel non-asymptotic analysis and establishes optimal sample complexity bounds for nonlinear TDC and Greedy-GQ algorithms in reinforcement learning.

Findings

01

Linear TDC achieves $oxed{ ext{O}( ext{}rac{1}{ ext{ } ext{epsilon}} ext{ } ext{log}(rac{1}{ ext{ } ext{epsilon}}) ext{)}}$ sample complexity.

02

Nonlinear TDC and Greedy-GQ attain $oxed{ ext{O}(rac{1}{ ext{ } ext{epsilon}}^2)}$ sample complexity.

03

First non-asymptotic convergence result for nonlinear TDC under Markovian sampling.

Abstract

Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with diminishing or accuracy-dependent stepsize. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with accuracy-independent constant stepsize. For linear TDC, we provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research