Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning
Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, Min Zhang

TL;DR
Step-GRPO is a post-training framework that internalizes dynamic early-exit capabilities into large reasoning models, reducing redundant computation while maintaining high accuracy.
Contribution
It introduces a novel training method that incorporates semantic step-structure and dynamic reward mechanisms to improve efficiency in reasoning models.
Findings
Reduces token consumption by 32% on Qwen3-8B.
Achieves better accuracy-efficiency trade-off across multiple benchmarks.
Outperforms traditional length-penalty methods in efficiency without accuracy loss.
Abstract
Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
