REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen

TL;DR
REINFORCE++ introduces a global advantage normalization technique for critic-free reinforcement learning, improving stability and bias reduction in large language model alignment tasks, outperforming existing methods including PPO.
Contribution
The paper proposes REINFORCE++, a novel critic-free RLHF algorithm using global advantage normalization, addressing bias and stability issues of prior local normalization methods.
Findings
REINFORCE++ achieves superior stability over existing critic-free methods.
The global advantage normalization reduces bias and improves performance.
REINFORCE++ outperforms PPO in complex reasoning tasks.
Abstract
Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on \textit{prompt-level (local)} advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on \textbf{Global Advantage Normalization}. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsREINFORCE · Entropy Regularization · Proximal Policy Optimization
