Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu

TL;DR
This paper analyzes entropy collapse in RLVR for language models, introduces a token-level entropy flow perspective, and proposes OPEFO to balance entropy dynamics, improving stability and performance.
Contribution
It offers a novel token-level entropy flow analysis and introduces OPEFO, an adaptive on-policy method to prevent entropy collapse in RLVR.
Findings
OPEFO stabilizes training across six reasoning benchmarks.
OPEFO enhances final performance compared to existing methods.
Entropy flow imbalance causes entropy collapse in RLVR.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
