REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu; Jason Klein Liu; Haotian Xu; Wei Shen

arXiv:2501.03262·cs.CL·November 11, 2025·2 cites

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen

PDF

Open Access 5 Repos 1 Models

TL;DR

REINFORCE++ introduces a global advantage normalization technique for critic-free reinforcement learning, improving stability and bias reduction in large language model alignment tasks, outperforming existing methods including PPO.

Contribution

The paper proposes REINFORCE++, a novel critic-free RLHF algorithm using global advantage normalization, addressing bias and stability issues of prior local normalization methods.

Findings

01

REINFORCE++ achieves superior stability over existing critic-free methods.

02

The global advantage normalization reduces bias and improves performance.

03

REINFORCE++ outperforms PPO in complex reasoning tasks.

Abstract

Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on \textit{prompt-level (local)} advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on \textbf{Global Advantage Normalization}. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
OpenRLHF/Llama-3-8b-rm-700k
model· 507 dl· ♡ 3
507 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsREINFORCE · Entropy Regularization · Proximal Policy Optimization