TL;DR
EDGE-GRPO introduces an entropy-driven approach with guided error correction to improve diversity and mitigate advantage collapse in reinforcement learning for large language models, showing superior performance on reasoning benchmarks.
Contribution
The paper proposes the EDGE-GRPO algorithm, combining entropy-driven advantage and guided error correction to address advantage collapse in policy optimization.
Findings
Effective mitigation of advantage collapse.
Improved response diversity in LLMs.
Superior performance on reasoning benchmarks.
Abstract
Large Language Models (LLMs) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantage and \textbf{G}uided \textbf{E}rror Correction to effectively mitigate…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The illustration and writing of the method are clear and easy to follow.
Major Weakness: * I’m doubtful about the fairness of the experimental setup. Since reference solutions are used, the proposed method benefits from additional distilled knowledge compared to other reinforcement learning methods such as Dr.GRPO, DAPO, and vanilla GRPO. I believe a fairer baseline would be SFT+GRPO. From Tables 1 and 2, it appears that the proposed method without reference solutions performs more or less on par with the other RL methods. Minor Weakness: * In terms of writing, I fo
1. The authors conduct a comprehensive study on the helpfulness of self-reflection and the role of entropy in both correct and incorrect responses within LLM reasoning tasks. 2. The paper proposes two strategies to process cases where all responses are incorrect during RL rollouts and introduce an entropy-based augmentation to better differentiate advantages in a group.
The motivation behind the proposed method is sound; however, the experimental settings and design exhibit several deficiencies. 1. Unreasonable experimental setup: The experimental settings are not sufficiently rigorous. Training an RLVR model with only 1K problems for a single epoch is far from convergence, as evidenced by prior works [1,2]. Moreover, the authors train their model on a challenging dataset but restrict the maximum response length to 1,024 tokens, which may significantly limit p
1. Focus on an important question on the GRPO advantage collapse. 2. Contains extensive empirical evaluation across multiple benchmarks. 3. High practical relevance due to increasing community interest in small-data RL.
1. 1The motivation for EDA relies on Figure 3, which claims that models often assign low entropy (high confidence) to incorrect responses. However, the paper computes entropy comparisons globally across all responses, not within each question, which is the only setting relevant for GRPO’s intra-group advantage ranking. Prior work [1] shows that LLMs generally assign higher confidence to the correct answer within each question. Therefore, a global statistic cannot support the claim that GRPO’s pe
- Some empirical findings are interesting, e.g., wrong rollout would likely to be also deterministic for model, with low entropy. - Evaluation is validated over different mode setups, across Qwen and Llama, and with diverse benchmarks. - The topic in RLVR this paper trying to address is timely and important.
- The writing clarity in format, notation, and math needs to be greatly enhanced. See my first series comment bullet points in Questions section for detail. - Evaluation benchmark is problematic and lacks clarity. Specifically, (i) why AMC has 83 questions? the standard AMC23 benchmark used by the community only has 40 questions; (ii) The baseline performance is lower reported compared to the previous work, taking Qwen2.5-Math-7B as an example, this work reports as 53.40, while previous publish
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
