Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin

TL;DR
This paper introduces Stepwise Guided Policy Optimization (SGPO), a method that improves reinforcement learning for language models by addressing the all-negative-sample issue in GRPO through response diversity, enhancing reasoning capabilities.
Contribution
The paper proposes a novel diversification framework using a step-wise judge model to mitigate all-negative-sample problems in GRPO, with theoretical and empirical validation across multiple model sizes and benchmarks.
Findings
SGPO accelerates learning dynamics in simplified settings.
SGPO improves average performance on reasoning benchmarks.
Effectiveness depends on negative sample structure and informativeness.
Abstract
Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training recent reasoning models, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a step-wise judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO's learning dynamics. We then empirically validate Stepwise Guided…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper's problem setting is highly reasonable and well-motivated. The “all-negative-sample” or “zero-reward" stagnation issue in GRPO is a genuine and recognized limitation. - The idea of learning from mistakes by assigning partial credit rather than treating all failures as a uniform r=0 (maybe a waste of information) seems logically sound. - The authors present a thorough empirical validation plan. The experiments cover nine different benchmarks, multiple model sizes (7B, 14B, 32B), o
- The quantitative results in Table 2, the paper's main empirical contribution, are not compelling. The overall performance gains are marginal at best and appear to diminish as model scale increases. - The paper claims weaker judges do not significantly degrade outcomes 1 but fails to support this. The experiments primarily use SOTA judges (O4-mini) or massively larger judges (Qwen-235B) to achieve gains on smaller models
1. The all-negative-sample groups issue in GRPO is a real limitation, especially in early training when model capabilities are weak. 2. Comprehensive experiments, the evaluation is thorough with: Multiple model sizes (7B, 14B, 32B); Both offline and online training settings; 9 diverse benchmarks including math reasoning tasks; Base models and distilled variants; Multiple judge models tested (both closed and open-source)
1. What is the definition of a "step"? Is it a reasoning step or a sentence? 2. The key of SGPO relies on judge models to determine whether a step is correct, and the judge models are significantly more capable than the policy models. The paper uses DeepSeek-V3-0324, Qwen3-235B-A22B, and QwQ-32B as judge models. This raises two issues: 1) Why not directly use the judge model to answer these questions, since the judge model's performance is significantly better than the policy model? 2) What if
1. Identifies a real limitation of GRPO: zero advantage in all-negative-sample groups. 2. Simple idea (assign partial credit to negative samples) that is intuitively reasonable.
1. SGPO relies on step-wise comparison against a ground-truth reasoning trajectory to identify the first error. This dependence on reference solutions limits applicability, as many real-world reasoning tasks do not provide step-level gold reasoning, making the method far less general than claimed. 2. The method presumes that model outputs contain clean, well-segmented “steps” that a judge can reliably evaluate. In practice, many models produce unstructured or interleaved reasoning, making step
1. **Motivation is clear and relevant.** The all-negative-sample problem in GRPO is a genuine issue in RL-based LLM training. Framing it as “learning from mistakes” is intuitive and connects well with human learning analogies. 2. **Simple and practical idea.** SGPO is conceptually straightforward and can be easily integrated into existing GRPO pipelines without large architectural changes. 3. **Comprehensive experiments.** The paper covers several model sizes and judge types, and demonstrates co
1. **Theoretical analysis limited to toy settings** The 2-step reasoning example, while pedagogical, is too simplistic to capture realistic dynamics of multi-token reasoning or large-group sampling. There is no empirical verification of the claimed acceleration trend in real GRPO trajectories (e.g., convergence rate plots or gradient magnitude analysis). 2. **Conceptual overlap and unclear novelty boundary** The idea of assigning *step-level rewards* via a verifier/judge overlaps with
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
MethodsBalanced Selection
