CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

Zongkai Liu; Fanqing Meng; Lingxiao Du; Zhixiang Zhou; Chao Yu; Wenqi Shao; Qiaosheng Zhang

arXiv:2505.12504·cs.LG·May 20, 2025

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

Zongkai Liu, Fanqing Meng, Lingxiao Du, Zhixiang Zhou, Chao Yu, Wenqi Shao, Qiaosheng Zhang

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces CPGD, a new algorithm for rule-based reinforcement learning in language models that enhances training stability and performance by regulating policy updates with KL divergence constraints and clipping mechanisms.

Contribution

The paper proposes CPGD, a novel stabilization method for RL in language models, combining KL divergence-based regularization and clipping to prevent training collapse.

Findings

01

CPGD reduces training instability in RL for language models.

02

Empirical results show improved performance over existing methods.

03

Theoretical analysis supports the stability benefits of CPGD.

Abstract

Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

modalminds/mm-eureka
pytorchOfficial

Models

🤗
Zkkkai/CPGD-7B
model· 10 dl· ♡ 1
10 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training