ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

Song Yu; Li Li; Wenwen Zhao; Zhisheng Yang

arXiv:2603.28204·cs.LG·April 6, 2026

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang

PDF

TL;DR

ERPO introduces token-level entropy regulation to improve reasoning in large language models, addressing limitations of sequence-level advantage methods like GRPO.

Contribution

This paper proposes ERPO, a novel token-level policy optimization method with entropy-aware gating and normalization, enhancing exploration and reasoning quality.

Findings

01

ERPO outperforms GRPO on mathematical benchmarks.

02

ERPO produces more concise and robust reasoning paths.

03

ERPO achieves comparable performance to larger models.

Abstract

Reinforcement learning from verifiable rewards has significantly advanced the reasoning capabilities of large language models. However, Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy's trajectory is most sensitive to perturbations. These pivots represent the "forks in the road" where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.