Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Tao Wang; Ruipeng Zhang; Sicun Gao

arXiv:2505.19247·cs.LG·May 27, 2025

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Tao Wang, Ruipeng Zhang, Sicun Gao

PDF

Open Access 1 Repo

TL;DR

Enhancing value estimation accuracy by increasing value update steps significantly improves vanilla policy gradient performance, making it comparable to advanced algorithms like PPO and more robust to hyperparameters.

Contribution

The paper reveals that improving value estimation, rather than trust region enforcement, is key to enhancing vanilla policy gradient algorithms.

Findings

01

Increasing value update steps boosts performance to match or surpass PPO.

02

Vanilla policy gradient becomes more robust to hyperparameters.

03

Simple modification improves effectiveness and usability of RL algorithms.

Abstract

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taowang0/value-estimation-vpg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiochemical and biochemical processes

MethodsEntropy Regularization · Trust Region Policy Optimization · Proximal Policy Optimization