Value Improved Actor Critic Algorithms
Yaniv Oren, Moritz A. Zanger, Pascal R. van der Vaart, Mustafa Mert Celikok, Matthijs T. J. Spaan, Wendelin Bohmer

TL;DR
This paper introduces a novel approach to Actor Critic algorithms that separates the policy improvement step from the policy evaluation, allowing for greedier updates of the critic while maintaining stable policy learning, leading to improved performance.
Contribution
It proposes decoupling the actor and critic policies to enable greedier critic updates, enhancing stability and performance in Actor Critic algorithms.
Findings
Incorporating value-improvement improves TD3 and SAC performance.
The approach achieves better or comparable results across various environments.
Minimal additional computational cost is required.
Abstract
To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic's policy (e.g. value improvement) with greedier updates while maintaining the slow gradient-based improvement to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making · Simulation Techniques and Applications
MethodsBatch Normalization · Weight Decay · Target Policy Smoothing · Adam · Dense Connections · Convolution · Clipped Double Q-learning · Twin Delayed Deep Deterministic · Experience Replay · *Communicated@Fast*How Do I Communicate to Expedia?
