Deep Reinforcement Learning and the Deadly Triad

Hado van Hasselt; Yotam Doron; Florian Strub; Matteo Hessel; Nicolas; Sonnerat; Joseph Modayil

arXiv:1812.02648·cs.AI·December 7, 2018·110 cites

Deep Reinforcement Learning and the Deadly Triad

Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas, Sonnerat, Joseph Modayil

PDF

Open Access

TL;DR

This paper investigates how the deadly triad of function approximation, bootstrapping, and off-policy learning affects deep reinforcement learning models, especially deep Q-networks, and explores the practical implications for their stability and performance.

Contribution

The study analyzes the impact of the deadly triad in deep Q-networks with experience replay, highlighting the conditions under which divergence occurs and identifying factors influencing stability.

Findings

01

Deep Q-networks can diverge due to the deadly triad under certain conditions.

02

Experience replay can mitigate or exacerbate divergence depending on implementation.

03

Understanding the deadly triad helps improve the stability of deep reinforcement learning algorithms.

Abstract

We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties, which indicates that there is at least a partial gap in our understanding. In this work, we investigate the impact of the deadly triad in practice, in the context of a family of popular deep reinforcement learning models - deep Q-networks trained with experience replay - analysing how the components of this system play a role in the emergence of the deadly triad, and in the agent's performance

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research

MethodsExperience Replay