ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang

TL;DR
ERPO introduces token-level entropy regulation to improve reasoning in large language models, addressing limitations of sequence-level advantage methods like GRPO.
Contribution
This paper proposes ERPO, a novel token-level policy optimization method with entropy-aware gating and normalization, enhancing exploration and reasoning quality.
Findings
ERPO outperforms GRPO on mathematical benchmarks.
ERPO produces more concise and robust reasoning paths.
ERPO achieves comparable performance to larger models.
Abstract
Reinforcement learning from verifiable rewards has significantly advanced the reasoning capabilities of large language models. However, Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy's trajectory is most sensitive to perturbations. These pivots represent the "forks in the road" where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
