Deep Reinforcement Learning with Gradient Eligibility Traces
Esraa Elelimy, Brett Daley, Andrew Patterson, Marlos C. Machado, Adam White, Martha White

TL;DR
This paper extends Gradient TD methods to support multistep credit assignment in deep reinforcement learning, improving stability and efficiency over existing methods, and demonstrates superior performance in MuJoCo and MinAtar environments.
Contribution
It introduces a multistep extension of the generalized PBE objective for Gradient TD methods, with new algorithms for deep RL that outperform existing approaches.
Findings
Proposed algorithms outperform PPO and StreamQ in MuJoCo and MinAtar.
Extended GTD methods achieve better stability and sample efficiency.
Demonstrated effectiveness of multistep credit assignment in deep RL environments.
Abstract
Achieving fast and stable off-policy learning in deep reinforcement learning (RL) is challenging. Most existing methods rely on semi-gradient temporal-difference (TD) methods for their simplicity and efficiency, but are consequently susceptible to divergence. While more principled approaches like Gradient TD (GTD) methods have strong convergence guarantees, they have rarely been used in deep RL. Recent work introduced the generalized Projected Bellman Error (), enabling GTD methods to work efficiently with nonlinear function approximation. However, this work is limited to one-step methods, which are slow at credit assignment and require a large number of samples. In this paper, we extend the generalized objective to support multistep credit assignment based on the -return and derive three gradient-based methods that optimize this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsExperience Replay · Proximal Policy Optimization
