Actor-Critic Reinforcement Learning with Phased Actor
Ruofan Wu, Junmin Zhong, Jennie Si

TL;DR
This paper introduces PAAC, a novel actor-critic reinforcement learning method that enhances policy gradient estimation, leading to improved control policies with higher robustness, faster learning, and better performance in continuous control tasks.
Contribution
The paper proposes PAAC, a phased actor-critic approach that improves policy gradient estimation, proves convergence and stability, and demonstrates superior performance over existing methods.
Findings
PAAC reduces variance in policy gradient estimates.
PAAC improves learning speed and robustness.
PAAC outperforms baseline algorithms in control tasks.
Abstract
Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
