The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding

TL;DR
This paper investigates the entropy collapse in reinforcement learning for reasoning language models, establishing an empirical relationship between entropy and performance, and proposes techniques to control entropy for improved exploration and performance.
Contribution
It uncovers the entropy-performance relationship in RL for LLMs, explains the entropy dynamics mechanism, and introduces methods to prevent entropy collapse, enhancing exploration.
Findings
Policy performance is traded from policy entropy, with a predictable ceiling.
Covariance between action probability and logits change drives entropy dynamics.
Proposed techniques effectively prevent entropy collapse and improve downstream results.
Abstract
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
+ Quality - Strengths - The authors seem to offer a novel theoretical perspective on the entropy collapse phenomenon. The technical proofs are mostly correct, except for an issue in results the authors have borrowed discussed below. - The resulting covariance-based regularization scheme studied empirically by the authors seems like a simple, reasonable approach to try in light of the presented theoretical findings. - Weaknesses * Major - It seems quite underwhelming that the exponent
Please see above.
i. The paper presents a potentially impactful analysis of the **"entropy collapse"** problem in LLM-RL. The "performance-entropy" exponential relationship and the "covariance-driven" theoretical mechanism are important scientific contributions to the field. Furthermore, the proposed Clip-Cov/KL-Cov methods are principled solutions based on this insight, and their proven effectiveness provides a new approach to solving the exploration problem in RL. ii. The paper delves deep into why existing me
### Weaknesses: While the paper is strong, there are several weaknesses that need to be addressed: (1) Limitation of Theoretical Assumptions: The core theory (Lemma 1, Prop 1, Thm 1 & 2) is derived based on a "tabular softmax policy." This is a very strong simplifying assumption. The gap from a tabular setting to a large-scale Transformer is enormous, and the authors do not sufficiently discuss why this tabular-based derivation applies so well to complex function approximation (FA). (2) Incons
* The paper is well-organized and clearly written. The paper has clean and good illustrations for their takeaways and empirical findings. * The empirical findings support the two methods proposed by the authors, that suppressing tokens with high logit-advantage covariances performs better.
Major comments: * I apologize for my unfamiliarity with the empirical studies---however, it strikes to me that it should be well known: for both finite‐ and infinite‐horizon Markov decision processes--discrete or continuous state/action spaces--if the system is fully observable and the Bellman operator satisfies standard measurability and compactness conditions, then there exists an optimal deterministic stationary policy. For example, see Puterman, Martin L. (1994). Therefore, the empirical obs
The main strength of the paper is making a good diagnosis for why performance in LLMs does not increase over time and rapidly saturates. Identifying that this is most likely due to the collapse of policy entropy is important.
The main weakness is that the empirical evidence is not proof that entropy collapse is the actual reason for lack of performance improvement, so it remains to be seen whether this is the main reason. Although the authors connect policy entropy changes with the covariance between the log policy and the logits differences, math is weak and subject to strong approximations. For instance, in Lemma1, the actual policy entropy difference should include in the first term an average over the new distr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
