CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse Prevention
Qingbin Li, Rongkun Xue, Jie Wang, Ming Zhou, Zhi Li, Xiaofeng Ji, Yongqi Wang, Miao Liu, Zheming Yang, Minghui Qiu, Jing Yang

TL;DR
CURE introduces a two-stage framework that balances exploration and exploitation in reinforcement learning for large language models, preventing entropy collapse and improving performance on math reasoning tasks.
Contribution
It proposes a novel two-stage method that re-generates critical tokens to maintain entropy and enhance exploration, leading to state-of-the-art results in math reasoning benchmarks.
Findings
Achieves 5% performance improvement over existing RLVR methods.
Maintains higher entropy levels during training, promoting exploration.
Outperforms baseline methods on six math benchmarks.
Abstract
Recent advances in Reinforcement Learning with Verified Reward (RLVR) have driven the emergence of more sophisticated cognitive behaviors in large language models (LLMs), thereby enhancing their reasoning capabilities. However, in prior RLVR pipelines, the repeated use of static initial-state sampling drawn exactly from the dataset distribution during each sampling phase produced overly deterministic, low diversity model behavior, which manifested as rapid entropy collapse and hindered sustained performance gains during prolonged training. To address this issue, we introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation. Specifically, in the first stage, to deliberately steer the model toward novel yet coherent contexts, we re-generate at high-entropy critical tokens and jointly optimize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
