Revisiting a Design Choice in Gradient Temporal Difference Learning
Xiaochi Qian, Shangtong Zhang

TL;DR
This paper revisits the $A^ op$TD algorithm in gradient temporal difference learning, demonstrating its effectiveness as a simpler, single-parameter alternative to GTD for stable off-policy RL, with comparable convergence rates.
Contribution
The paper proves that a variant of $A^ op$TD is an effective, simpler alternative to GTD, requiring only one set of parameters and one learning rate.
Findings
$A_t^ op$TD is an effective solution to the deadly triad.
The variant of $A_t^ op$TD has convergence rates comparable to on-policy TD.
The proposed method simplifies tuning in off-policy RL algorithms.
Abstract
Off-policy learning enables a reinforcement learning (RL) agent to reason counterfactually about policies that are not executed and is one of the most important ideas in RL. It, however, can lead to instability when combined with function approximation and bootstrapping, two arguably indispensable ingredients for large-scale reinforcement learning. This is the notorious deadly triad. The seminal work Sutton et al. (2008) pioneers Gradient Temporal Difference learning (GTD) as the first solution to the deadly triad, which has enjoyed massive success thereafter. During the derivation of GTD, some intermediate algorithm, called TD, was invented but soon deemed inferior. In this paper, we revisit this TD and prove that a variant of TD, called TD, is also an effective solution to the deadly triad. Furthermore, this TD only needs one set of…
Peer Reviews
Decision·ICLR 2025 Poster
The problem to tackle is well stated, which is to stabilize off-policy learning and improve the previous algorithm ATD and GTD: the proposed algorithm saves memory compared to the ATD algorithm and increases the convergence rate compared to GTD. Also, the paper is clearly written and easy to follow, with rigorously stated assumptions and lemmas.
The approach needs to be more motivated. GTD is known to suffer from a low convergent rate compared to TD. More experiments to compare the convergence speed and some intuition on why the proposed algorithm can fasten the learning would be great. Also, the algorithm needs to fit better into the literature. ETD, introduced by Mahmood and colleagues (2015), is another stable off-policy algorithm. Also, a target network is suggested to help convergence (Zhang et al., 2021; Fellows et al., 2023; Che
The strengths of the paper include its originality, quality, and clarity: 1. The idea in this paper is novel to the best of my knowledge. It’s neat to use two samples distanced away from each other to estimate the terms involving two $A$ matrices, which are independent if their gap is large. The sublinear memory requirement also renders this idea a practical approach. 2. The quality of the paper is also a strength. The asymptotic convergence of the proposed algorithm is novel and may bring value
The paper has weaknesses in its significance and relevant work discussion: 1. The paper may be limited in its significance. - On the theory side, the finite time analysis is based on a variant of the proposed algorithm with a projection step, which is absent in the actual algorithm. Thus, the comparison between its convergence rate in this case with that of on-policy TD may not be very valuable. Note that finite sample analysis of the actual algorithm is possible, as also pointed out in the p
The paper expand the idea from $A^TTD$ and proposed a new method to solve double sampling issue in off-policy learning. Compared with $A^TTD$, this methods required less memory. The authors also provided the convergence analysis of their method.
This paper is really interesting to me. However, I have several questions. 1. The advantage of selecting $f(t)$ as an increasing function over a constant one is not immediately clear. The authors state in lines 186–187 that classical convergence analysis can be applied to establish the convergence rate. Thus, it seems that a constant $f(t)$ could also ensure convergence. Additionally, the experimental results suggest that setting $f(t)=2$ is sufficient to resolve Baird’s counterexample, which
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
