S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Muzhi Dai; Chenxu Yang; Qingyi Si

arXiv:2505.07686·cs.AI·May 20, 2025

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Muzhi Dai, Chenxu Yang, Qingyi Si

PDF

Open Access

TL;DR

This paper introduces S-GRPO, a reinforcement learning method that enables reasoning models to evaluate and exit early during chain-of-thought generation, reducing redundancy and improving accuracy.

Contribution

S-GRPO is a novel serial sampling reinforcement learning paradigm that promotes early exit in reasoning models, enhancing efficiency and correctness.

Findings

01

Reduces reasoning sequence length by up to 61.1%.

02

Improves accuracy by up to 6.08% across benchmarks.

03

Compatible with models like Qwen3 and Deepseek-distill.

Abstract

As Test-Time Scaling emerges as an active research focus in the large language model community, advanced post-training methods increasingly emphasize extending chain-of-thought (CoT) generation length, thereby enhancing reasoning capabilities to approach Deepseek R1-like reasoning models. However, recent studies reveal that reasoning models (even Qwen3) consistently exhibit excessive thought redundancy in CoT generation. This overthinking issue arises from the inherent limitations of conventional outcome-reward reinforcement learning, which systematically overlooks the regulation of intermediate reasoning processes. This paper introduces Serial-Group Decaying-Reward Policy Optimization (S-GRPO), a novel reinforcement learning paradigm that enables models to implicitly evaluate the sufficiency of intermediate reasoning steps, thereby facilitating early exit in CoT generation. Unlike…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsFocus