EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang

TL;DR
EP-GRPO introduces a novel reinforcement learning framework that enhances policy optimization by leveraging intrinsic information flow and entropy-based guidance, significantly improving reasoning accuracy and training efficiency.
Contribution
It systematically identifies GRPO's limitations and proposes EP-GRPO, which uses entropy gating and implicit process signals for dense, self-supervised policy guidance.
Findings
EP-GRPO outperforms GRPO on mathematical reasoning benchmarks.
It achieves higher accuracy and training efficiency.
The method maintains gradient flow under zero reward variance.
Abstract
Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
