ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy
Gengyang Li, Yifeng Gao, Yuming Li, Yunfang Wu

TL;DR
ThinkLess is a training-free method that reduces reasoning redundancy in large language models by early termination of reasoning, significantly improving inference efficiency while maintaining answer quality.
Contribution
It introduces a novel early termination approach that leverages attention insights, inserting terminator tokens earlier to skip redundant reasoning without model fine-tuning.
Findings
Achieves comparable accuracy to full-length CoT decoding.
Reduces decoding time and memory consumption significantly.
Operates without fine-tuning or auxiliary data.
Abstract
While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Intelligent Tutoring Systems and Adaptive Learning
MethodsFocus
