The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui; Yuchen Zhang; Jiacheng Chen; Lifan Yuan; Zhi Wang; Yuxin Zuo; Haozhan Li; Yuchen Fan; Huayu Chen; Weize Chen; Zhiyuan Liu; Hao Peng; Lei Bai; Wanli Ouyang; Yu Cheng; Bowen Zhou; Ning Ding

arXiv:2505.22617·cs.LG·May 29, 2025

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding

PDF

Open Access 4 Reviews

TL;DR

This paper investigates the entropy collapse in reinforcement learning for reasoning language models, establishing an empirical relationship between entropy and performance, and proposes techniques to control entropy for improved exploration and performance.

Contribution

It uncovers the entropy-performance relationship in RL for LLMs, explains the entropy dynamics mechanism, and introduces methods to prevent entropy collapse, enhancing exploration.

Findings

01

Policy performance is traded from policy entropy, with a predictable ceiling.

02

Covariance between action probability and logits change drives entropy dynamics.

03

Proposed techniques effectively prevent entropy collapse and improve downstream results.

Abstract

This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

+ Quality - Strengths - The authors seem to offer a novel theoretical perspective on the entropy collapse phenomenon. The technical proofs are mostly correct, except for an issue in results the authors have borrowed discussed below. - The resulting covariance-based regularization scheme studied empirically by the authors seems like a simple, reasonable approach to try in light of the presented theoretical findings. - Weaknesses * Major - It seems quite underwhelming that the exponent

Weaknesses

Please see above.

Reviewer 02Rating 6Confidence 4

Strengths

i. The paper presents a potentially impactful analysis of the **"entropy collapse"** problem in LLM-RL. The "performance-entropy" exponential relationship and the "covariance-driven" theoretical mechanism are important scientific contributions to the field. Furthermore, the proposed Clip-Cov/KL-Cov methods are principled solutions based on this insight, and their proven effectiveness provides a new approach to solving the exploration problem in RL. ii. The paper delves deep into why existing me

Weaknesses

### Weaknesses: While the paper is strong, there are several weaknesses that need to be addressed: (1) Limitation of Theoretical Assumptions: The core theory (Lemma 1, Prop 1, Thm 1 & 2) is derived based on a "tabular softmax policy." This is a very strong simplifying assumption. The gap from a tabular setting to a large-scale Transformer is enormous, and the authors do not sufficiently discuss why this tabular-based derivation applies so well to complex function approximation (FA). (2) Incons

Reviewer 03Rating 4Confidence 3

Strengths

* The paper is well-organized and clearly written. The paper has clean and good illustrations for their takeaways and empirical findings. * The empirical findings support the two methods proposed by the authors, that suppressing tokens with high logit-advantage covariances performs better.

Weaknesses

Major comments: * I apologize for my unfamiliarity with the empirical studies---however, it strikes to me that it should be well known: for both finite‐ and infinite‐horizon Markov decision processes--discrete or continuous state/action spaces--if the system is fully observable and the Bellman operator satisfies standard measurability and compactness conditions, then there exists an optimal deterministic stationary policy. For example, see Puterman, Martin L. (1994). Therefore, the empirical obs

Reviewer 04Rating 2Confidence 3

Strengths

The main strength of the paper is making a good diagnosis for why performance in LLMs does not increase over time and rapidly saturates. Identifying that this is most likely due to the collapse of policy entropy is important.

Weaknesses

The main weakness is that the empirical evidence is not proof that entropy collapse is the actual reason for lack of performance improvement, so it remains to be seen whether this is the main reason. Although the authors connect policy entropy changes with the covariance between the log policy and the logits differences, math is weak and subject to strong approximations. For instance, in Lemma1, the actual policy entropy difference should include in the first term an average over the new distr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training