Gradient Descent Temporal Difference-difference Learning
Rong J.B. Zhu, James M. Murray

TL;DR
This paper introduces Gradient-DD, an improved off-policy reinforcement learning algorithm that incorporates second-order differences, demonstrating faster convergence and better performance than existing methods like GTD2 and traditional TD learning.
Contribution
The paper proposes Gradient-DD, a novel algorithm that enhances GTD2 with second-order differences, and provides theoretical convergence proof and empirical validation.
Findings
Gradient-DD converges faster than GTD2.
Gradient-DD outperforms GTD2 in various tasks.
In some cases, Gradient-DD surpasses conventional TD learning.
Abstract
Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to improve GTD2, a GTD algorithm, by introducing second-order differences in successive parameter updates. We investigate this algorithm in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research
