Deep Reinforcement Learning and the Deadly Triad
Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas, Sonnerat, Joseph Modayil

TL;DR
This paper investigates how the deadly triad of function approximation, bootstrapping, and off-policy learning affects deep reinforcement learning models, especially deep Q-networks, and explores the practical implications for their stability and performance.
Contribution
The study analyzes the impact of the deadly triad in deep Q-networks with experience replay, highlighting the conditions under which divergence occurs and identifying factors influencing stability.
Findings
Deep Q-networks can diverge due to the deadly triad under certain conditions.
Experience replay can mitigate or exacerbate divergence depending on implementation.
Understanding the deadly triad helps improve the stability of deep reinforcement learning algorithms.
Abstract
We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties, which indicates that there is at least a partial gap in our understanding. In this work, we investigate the impact of the deadly triad in practice, in the context of a family of popular deep reinforcement learning models - deep Q-networks trained with experience replay - analysing how the components of this system play a role in the emergence of the deadly triad, and in the agent's performance
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research
MethodsExperience Replay
