SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation
Gengyang Li, Wang Cai, Yifeng Gao, Yunfang Wu

TL;DR
SyncThink is a training-free decoding method that efficiently terminates reasoning in chain-of-thought prompting by monitoring a special token, reducing inference cost and improving accuracy on reasoning tasks.
Contribution
It introduces a novel, training-free approach to align inference termination with reasoning saturation using a model's own signals, without modifying model weights.
Findings
Achieves comparable accuracy with fewer tokens and lower latency.
Reduces inference time from 92.01 s to 28.68 s on average.
Improves accuracy by up to +8.1 on long-horizon tasks.
Abstract
Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
