Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
Dehai Min, Giovanni Vaccarino, Huiyi Chen, Yongliang Wu, Gal Yona, Lu Cheng

TL;DR
This paper introduces PUMA, a framework that uses reasoning-level semantic redundancy to enable early exits in reasoning models, reducing tokens and latency without sacrificing accuracy.
Contribution
It proposes a novel semantic redundancy signal for early stopping in reasoning models and develops PUMA, a plug-and-play framework combining redundancy detection and verification.
Findings
PUMA reduces 26.2% of tokens on average across benchmarks.
Semantic redundancy effectively indicates reasoning convergence.
PUMA maintains accuracy while enabling early exits in various tasks.
Abstract
Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
