Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation
Han-Dong Lim, Donghwan Lee

TL;DR
This paper investigates the convergence properties of off-policy n-step TD-learning algorithms with linear function approximation, demonstrating convergence as the sampling horizon increases, and bridges model-based and model-free RL methods.
Contribution
It provides a theoretical analysis showing convergence of n-step TD algorithms in off-policy settings as n grows large, linking deterministic and stochastic approaches.
Findings
n-step TD algorithms converge with sufficiently large n
Analysis connects model-based and model-free RL methods
Provides theoretical guarantees for off-policy TD learning
Abstract
This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad'' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that -step TD-learning algorithms converge to a solution as the sampling horizon increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when is sufficiently large. Based on these findings, in the second part,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Reinforcement Learning in Robotics
