Correcting Momentum in Temporal Difference Learning
Emmanuel Bengio, Joelle Pineau, Doina Precup

TL;DR
This paper identifies a problem with momentum in TD learning, proposes a correction to improve sample efficiency, and highlights that deep RL benefits from tailored techniques rather than direct transfer from supervised learning.
Contribution
It introduces a first-order correction to momentum in TD learning, addressing gradient staleness and improving policy evaluation efficiency.
Findings
Correction improves sample efficiency in policy evaluation
Momentum in TD learning accumulates doubly stale gradients
Deep RL techniques should be adapted from supervised learning methods
Abstract
A common optimization tool used in deep reinforcement learning is momentum, which consists in accumulating and discounting past gradients, reapplying them at each iteration. We argue that, unlike in supervised learning, momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale: not only does the gradient of the loss change due to parameter updates, the loss itself changes due to bootstrapping. We first show that this phenomenon exists, and then propose a first-order correction term to momentum. We show that this correction term improves sample efficiency in policy evaluation by correcting target value drift. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition
