Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin, Yuntao Li, Wenping Hu, Ruiming Tang, Kun Gai, Guorui Zhou

TL;DR
This paper introduces Entropy Ratio Clipping (ERC), a novel global constraint mechanism for reinforcement learning that stabilizes policy updates by regulating the relative change in policy entropy, improving performance across benchmarks.
Contribution
The paper proposes ERC, a new entropy ratio-based clipping method that addresses global distributional shifts in RL, enhancing stability and performance of existing algorithms.
Findings
ERC improves stability in RL training.
ERC enhances performance across multiple benchmarks.
ERC effectively regulates policy exploration changes.
Abstract
Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an \textbf{Entropy Ratio Clipping} (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
