Off-policy Learning with Eligibility Traces: A Survey
Matthieu Geist, Bruno Scherrer (INRIA Lorraine - LORIA)

TL;DR
This survey reviews off-policy learning algorithms with eligibility traces in Markov Decision Processes, unifying existing methods, proposing new extensions, and empirically comparing their performance on benchmark problems.
Contribution
It provides a systematic derivation of off-policy algorithms with eligibility traces, introduces new methods, and evaluates their empirical performance.
Findings
Off-policy LSTD(λ) and LSPE(λ) perform best in experiments.
TD(λ) is effective when feature space is large.
Most algorithms show convergence properties discussed in the literature.
Abstract
In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces. This leads to some known algorithms - off-policy LSTD(\lambda), LSPE(\lambda), TD(\lambda), TDC/GQ(\lambda) - and suggests new extensions - off-policy FPKF(\lambda), BRM(\lambda), gBRM(\lambda), GTD2(\lambda). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form, discuss their known convergence properties and illustrate their relative empirical behavior on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Machine Learning and Data Classification
