Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors
Juan Sebastian Rojas, Chi-Guhn Lee

TL;DR
This paper investigates two different interpretations of the temporal difference (TD) error in deep reinforcement learning, revealing that they can diverge significantly in nonlinear architectures and impact algorithm performance.
Contribution
It demonstrates that the common assumption of equivalence between TD error interpretations does not always hold in deep RL, especially with nonlinear models, affecting algorithm outcomes.
Findings
Different TD error interpretations diverge in nonlinear deep RL models
Choosing the interpretation impacts the performance of RL algorithms
Default bootstrapped target interpretation may not always be valid in deep RL
Abstract
The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Adversarial Robustness in Machine Learning
