Off-Policy Actor-Critic with Emphatic Weightings
Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White

TL;DR
This paper introduces ACE, a new off-policy actor-critic algorithm that uses emphatic weightings to ensure convergence to the optimal policy, addressing limitations of previous semi-gradient methods.
Contribution
The paper derives a unified off-policy policy gradient theorem using emphatic weightings and interest functions, and proposes the ACE algorithm with proven convergence properties.
Findings
ACE outperforms previous methods like OffPAC in experiments.
Direct approximation of emphatic weightings improves stability and performance.
ACE converges to the optimal solution in tested environments.
Abstract
A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods--particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)--converge to the wrong solution whereas ACE finds…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
