On a convergent off -policy temporal difference learning algorithm in   on-line learning environment

Prasenjit Karmakar; Rajkumar Maity; Shalabh Bhatnagar

arXiv:1605.06076·cs.LG·May 20, 2016

On a convergent off -policy temporal difference learning algorithm in on-line learning environment

Prasenjit Karmakar, Rajkumar Maity, Shalabh Bhatnagar

PDF

Open Access

TL;DR

This paper rigorously analyzes the convergence of an off-policy temporal difference learning algorithm with linear function approximation in online environments, supported by empirical results on standard counterexamples.

Contribution

It provides a formal convergence proof for the TDC algorithm with importance weighting in online settings, which was previously lacking.

Findings

01

The TDC algorithm converges under certain conditions.

02

Empirical results validate theoretical convergence on standard counterexamples.

03

The analysis demonstrates linear per-step computational complexity.

Abstract

In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in "online" learning environment. The algorithm considered here is TDC with importance weighting introduced by Maei et al. We support our theoretical results by providing suitable empirical results for standard off-policy counterexamples.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Machine Learning and ELM