Should All Temporal Difference Learning Use Emphasis?
Xiang Gu, Sina Ghiassian, Richard S. Sutton

TL;DR
This paper empirically demonstrates that Emphatic Temporal Difference (ETD) learning often converges and outperforms traditional TD methods in both on-policy and off-policy scenarios, suggesting ETD as a promising alternative.
Contribution
The study provides empirical evidence that ETD can outperform TD in various settings, challenging the notion that ETD is only beneficial for off-policy convergence.
Findings
ETD converges on several on-policy experiments where TD diverges.
ETD outperforms TD on the mountain car prediction task.
ETD shows potential as a general substitute for conventional TD learning.
Abstract
Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
