Off-Policy Actor-Critic
Thomas Degris, Martha White, Richard S. Sutton

TL;DR
This paper introduces the first off-policy actor-critic algorithm that is online, scalable, and combines the advantages of off-policy learning with actor-critic methods, enabling practical reinforcement learning in large action spaces.
Contribution
It presents a novel incremental off-policy actor-critic algorithm with linear complexity, extending off-policy gradient methods to actor-critic frameworks.
Findings
Achieves better or comparable performance on benchmark problems
Proves convergence under standard assumptions
Scales linearly with the number of learned weights
Abstract
This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Optimization and Search Problems
