Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains
Maik Overmars, Jasper Goseling, Richard Boucherie

TL;DR
This paper proves convergence of off-policy TD(0) with linear function approximation for reversible Markov chains, providing explicit bounds and using a modified stochastic approximation framework.
Contribution
It establishes convergence guarantees for the standard off-policy TD(0) algorithm under reversibility assumptions, improving existing results with explicit bounds.
Findings
Convergence with probability one and zero projected Bellman error.
Explicit upper bound on discount factor for convergence.
Application to reversible Markov chains like random walks.
Abstract
We study the convergence of off-policy TD(0) with linear function approximation when used to approximate the expected discounted reward in a Markov chain. It is well known that the combination of off-policy learning and function approximation can lead to divergence of the algorithm. Existing results for this setting modify the algorithm, for instance by reweighing the updates using importance sampling. This establishes convergence at the expense of additional complexity. In contrast, our approach is to analyse the standard algorithm, but to restrict our attention to the class of reversible Markov chains. We demonstrate convergence under this mild reversibility condition on the structure of the chain, which in many applications can be assumed using domain knowledge. In particular, we establish a convergence guarantee under an upper bound on the discount factor in terms of the difference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
