On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning
Huizhen Yu

TL;DR
This paper provides comprehensive convergence analysis for various gradient-based off-policy TD algorithms with linear function approximation, covering different stepsize strategies and lambda-parameter schemes.
Contribution
It introduces new convergence results for multiple variants of gradient-based TD algorithms, including their robustified, mirror-descent, and single-time-scale forms, under diverse conditions.
Findings
Convergence established for constant, diminishing, and square-summable stepsizes.
Analysis covers state-dependent, history-dependent, and combined lambda schemes.
Results include almost sure convergence and asymptotic behavior characterizations.
Abstract
We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms, which we call GTD and which minimize the mean squared projected Bellman error using stochastic gradient-descent; (ii) their "robustified" biased variants; (iii) their mirror-descent versions which combine the mirror-descent idea with TD learning; and (iv) a single-time-scale version of GTD that solves minimax problems formulated for approximate policy evaluation. We derive convergence results for three types of stepsizes: constant stepsize, slowly diminishing stepsize, as well as the standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
