Safe and Efficient Off-Policy Reinforcement Learning
R\'emi Munos, Tom Stepleton, Anna Harutyunyan, Marc G. Bellemare

TL;DR
This paper introduces Retrace(λ), a novel off-policy reinforcement learning algorithm that is safe, low-variance, and sample-efficient, with proven convergence properties and demonstrated success on Atari games.
Contribution
It presents a new return-based off-policy RL algorithm, Retrace(λ), with strong theoretical guarantees and practical effectiveness, including convergence to optimal Q-values without the GLIE assumption.
Findings
Retrace(λ) is safe and low-variance for off-policy learning.
Proven convergence of Retrace(λ) and Watkins' Q(λ).
Successful application on Atari 2600 games.
Abstract
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(), which was an open…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Formal Methods in Verification
MethodsRetrace
