Simplifying Deep Temporal Difference Learning

Matteo Gallici; Mattie Fellows; Benjamin Ellis; Bartomeu Pou; Ivan; Masmitja; Jakob Nicolaus Foerster; Mario Martin

arXiv:2407.04811·cs.LG·April 23, 2025·1 cites

Simplifying Deep Temporal Difference Learning

Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan, Masmitja, Jakob Nicolaus Foerster, Mario Martin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a simplified, provably convergent off-policy deep Q-learning algorithm called PQN, which eliminates the need for target networks and replay buffers, achieving competitive performance and higher speed.

Contribution

It provides the first theoretical proof that regularisation like LayerNorm ensures convergence without target networks or replay buffers in off-policy TD learning.

Findings

01

PQN is up to 50x faster than traditional DQN.

02

Regularisation techniques enable stable off-policy TD learning.

03

PQN performs competitively with complex algorithms like Rainbow.

Abstract

Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a large replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms the sample efficiency and, similarly, the large replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify off-policy TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network or replay buffer, even with off-policy data.…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 3

Strengths

The work on simplifying deep reinforcement learning and removing techniques that might not be necessary, like replay buffer and target networks, is undoubtedly fundamental to deep reinforcement research. This has vast implications for rethinking the widely existing deep RL approaches and can help in other important directions, like scaling RL with the number of parameters/samples. This paper provides a unique view that challenges existing beliefs on the importance of replay buffers and target ne

Weaknesses

- The theoretical part of the manuscript is largely incoherent. - The current manuscript scatters many things in the theory parts. It lacks a proper flow of ideas when describing the theoretical results and their implications, which makes it difficult to follow. Currently, it reads as bullet points, listing findings quickly without proper linking between subsequent findings or results. - For example, the current theorems and lemmas are not well integrated with the text before and after them

Reviewer 02Rating 8Confidence 3

Strengths

- This paper is well-written and easy to follow, with clear presentation of both theoretical derivations and experimental results. - The paper's motivation is clear and compelling enough to me: it mainly provides a simplified Q-learning baseline that effectively leverages GPU parallelization and vectorized environments. - The proposed experimental evaluation is relatively comprehensive: it covers multiple domains including proof-of-concept environments, standard single-agent benchmarks such as A

Weaknesses

There is no major weakness of this paper, but feel free to check the question section for minor questions.

Reviewer 03Rating 8Confidence 2

Strengths

The paper is well written, well formatted and quite readable. The authors present the essence of their results very well and show that stability of TD algorithms reduces to checking that a Jacobian is negative definite on the unit circle. This is a nice succinct and somehow intuitive result. Given the complexity of the proof, it was good to see the summary presented so concisely. Other results are then presented after this, including some insight into the causes of instability and then the app

Weaknesses

A criticism is that the proof of the main theoretical results is long (there are 20 pages of additional material) and I would say not particularly well organised. Before the authors give the proofs, in my view, it would be good for them to outline the main steps. I found the proofs hard to follow and as one goes through the proofs, there is a feeling of being somewhat adrift. In other words, the summary in the main paper is good; the actual proofs in the appendix are less well clear.

Code & Models

Repositories

mttga/purejaxql
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsDense Connections · Convolution · Q-Learning · Deep Q-Network · Recurrent Replay Distributed DQN · Entropy Regularization · Proximal Policy Optimization