Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Renren Jin; Pengzhi Gao; Yuqi Ren; Zhuowen Han; Tongxuan Zhang; Wuwei Huang; Wei Liu; Jian Luan; Deyi Xiong

arXiv:2511.05993·cs.CL·April 21, 2026

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, Deyi Xiong

PDF

TL;DR

This paper investigates entropy dynamics in reinforcement learning with verifiable rewards for large language models, identifying key factors influencing entropy collapse and proposing a novel reweighting method to mitigate it.

Contribution

It provides a comprehensive analysis of entropy behavior in RLVR for LLMs and introduces Positive-Advantage Reweighting to effectively control entropy during training.

Findings

01

Entropy collapse correlates with response diversity, calibration, and performance.

02

Clipping thresholds, update frequency, and data diversity significantly affect entropy.

03

Tokens with positive advantages are primary drivers of entropy collapse.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a prominent paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, leading to premature convergence to suboptimal local minima and hindering further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To bridge this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our results identify three key factors that influence entropy: the clipping thresholds in the optimization objective, the number of off-policy updates, and the diversity of the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.