Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

Benteng Chen; Weida Wang; Shufei Zhang; Mingbao Lin; Min Zhang

arXiv:2604.16890·cs.AI·April 21, 2026

Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, Min Zhang

PDF

TL;DR

Step-GRPO is a post-training framework that internalizes dynamic early-exit capabilities into large reasoning models, reducing redundant computation while maintaining high accuracy.

Contribution

It introduces a novel training method that incorporates semantic step-structure and dynamic reward mechanisms to improve efficiency in reasoning models.

Findings

01

Reduces token consumption by 32% on Qwen3-8B.

02

Achieves better accuracy-efficiency trade-off across multiple benchmarks.

03

Outperforms traditional length-penalty methods in efficiency without accuracy loss.

Abstract

Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.