On Convergence of some Gradient-based Temporal-Differences Algorithms   for Off-Policy Learning

Huizhen Yu

arXiv:1712.09652·cs.LG·March 30, 2018·25 cites

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

Huizhen Yu

PDF

Open Access

TL;DR

This paper provides comprehensive convergence analysis for various gradient-based off-policy TD algorithms with linear function approximation, covering different stepsize strategies and lambda-parameter schemes.

Contribution

It introduces new convergence results for multiple variants of gradient-based TD algorithms, including their robustified, mirror-descent, and single-time-scale forms, under diverse conditions.

Findings

01

Convergence established for constant, diminishing, and square-summable stepsizes.

02

Analysis covers state-dependent, history-dependent, and combined lambda schemes.

03

Results include almost sure convergence and asymptotic behavior characterizations.

Abstract

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms, which we call GTD and which minimize the mean squared projected Bellman error using stochastic gradient-descent; (ii) their "robustified" biased variants; (iii) their mirror-descent versions which combine the mirror-descent idea with TD learning; and (iv) a single-time-scale version of GTD that solves minimax problems formulated for approximate policy evaluation. We derive convergence results for three types of stepsizes: constant stepsize, slowly diminishing stepsize, as well as the standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems