Scaling Life-long Off-policy Learning
Adam White, Joseph Modayil, and Richard S. Sutton

TL;DR
This paper demonstrates scalable life-long off-policy reinforcement learning by simultaneously learning hundreds of value functions for thousands of policies in real-time on a robot, advancing AI's ability to learn continuously from diverse experiences.
Contribution
It introduces scalable off-policy learning of many policies using GTD({\lambda}) and online MSPBE estimators, enabling real-time learning of thousands of value functions on a robot.
Findings
GTD({\lambda}) with tile coding learns hundreds of predictions accurately.
Two online MSPBE estimators are validated for off-policy learning.
Real-time learning of 1,000 policies on a robot demonstrates scalability.
Abstract
We pursue a life-long learning approach to artificial intelligence that makes extensive use of reinforcement learning algorithms. We build on our prior work with general value functions (GVFs) and the Horde architecture. GVFs have been shown able to represent a wide variety of facts about the world's dynamics that may be useful to a long-lived agent (Sutton et al. 2011). We have also previously shown scaling - that thousands of on-policy GVFs can be learned accurately in real-time on a mobile robot (Modayil, White & Sutton 2011). That work was limited in that it learned about only one policy at a time, whereas the greatest potential benefits of life-long learning come from learning about many policies in parallel, as we explore in this paper. Many new challenges arise in this off-policy learning setting. To deal with convergence and efficiency challenges, we utilize the recently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
