Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao

TL;DR
This paper introduces a novel dynamic entropy control mechanism in RLVR that uses gradient-preserving clipping to prevent entropy collapse, improving LLM reasoning and output diversity.
Contribution
It connects gradient-preserving clipping with entropy regulation, proposing a dynamic clipping threshold method for precise entropy control in RLVR.
Findings
Dynamic clipping thresholds effectively prevent entropy collapse.
Proposed strategies outperform static methods in benchmarks.
Theoretical and empirical analysis links importance sampling ratios to entropy changes.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
