Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Dehai Min; Giovanni Vaccarino; Huiyi Chen; Yongliang Wu; Gal Yona; Lu Cheng

arXiv:2605.17672·cs.CL·May 19, 2026

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Dehai Min, Giovanni Vaccarino, Huiyi Chen, Yongliang Wu, Gal Yona, Lu Cheng

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces PUMA, a framework that uses reasoning-level semantic redundancy to enable early exits in reasoning models, reducing tokens and latency without sacrificing accuracy.

Contribution

It proposes a novel semantic redundancy signal for early stopping in reasoning models and develops PUMA, a plug-and-play framework combining redundancy detection and verification.

Findings

01

PUMA reduces 26.2% of tokens on average across benchmarks.

02

Semantic redundancy effectively indicates reasoning convergence.

03

PUMA maintains accuracy while enabling early exits in various tasks.

Abstract

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

giovanni-vaccarino/PUMA
github

Models

🤗
ZhishanQ/qwen3-embedding-redundancy-detector-0.6B
model· 51 dl
51 dl

Datasets

ZhishanQ/puma-rd-training-data
dataset· 17 dl
17 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.