Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning
Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel, Graves

TL;DR
This paper introduces fixed-horizon temporal difference methods for reinforcement learning, which predict rewards over a set number of steps and avoid stability issues of traditional off-policy TD methods, with proven convergence.
Contribution
The paper presents a novel fixed-horizon TD approach that is immune to the deadly triad stability problems and demonstrates its effectiveness and convergence.
Findings
Fixed-horizon methods are stable and avoid the deadly triad.
They can be used competitively with Q-learning.
Convergence is proven for linear and general function approximation.
Abstract
We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a number of future time steps. To learn the value function for horizon , these algorithms bootstrap from the value function for horizon , or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as "the deadly triad"). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and -step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsQ-Learning
