The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models
Siqi Fan, Bowen Qin, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun

TL;DR
This paper evaluates reasoning efficiency in large language models, revealing instruct models are more efficient and proposing a two-stage pipeline, COTHINK, to reduce token usage while maintaining accuracy.
Contribution
It formalizes reasoning efficiency as a relative measure, systematically compares models, and introduces COTHINK, a two-stage pipeline that improves efficiency without sacrificing accuracy.
Findings
Instruct models are generally more efficient than thinking models.
Problem difficulty influences the efficiency of reasoning models.
COTHINK reduces token usage by 21.1% on benchmark datasets.
Abstract
Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
CoThink is simple and delivers performance gains for some models on some tasks. On average, models do gain in performance across some tasks. This suggests that depending on the decision making process, it may be worth trying this approach for some applications.
The improvement in performance isn’t that clear-cut. When considered alongside the relative simplicity (which may also read as limited novelty) that is either a pro or a con. I am personally inclined to forgive simple methods more for inconsistent performance gains. Would be nice to see some comparison to or discussion of methods that force early stopping such as [1,2] Limited motivation of design choices such as prompts. No discussion of how well the models instruction follow/conform to the o
1. The paper introduces an interesting metric that enables consistent comparison across models and tasks. This quantitative perspective fills a gap in evaluating reasoning models beyond simple accuracy or token count. 2. The proposed COTHINK framework is simple and practical. It requires no difficulty prediction, is easy to reproduce, and achieves meaningful compute savings without sacrificing performance. 3. The paper is clearly written and well-organized. Motivation, method, and analysis are c
1. While the current experiments focus exclusively on mathematical reasoning, extending the evaluation to at least one non-math reasoning domain (e.g., code generation on HumanEval or MBPP, or knowledge reasoning on GPQA-Diamond or the non-math subset of MMLU-Pro) would strengthen the paper’s generality and demonstrate the broader applicability of the proposed framework. 2. The robustness of the two-stage structure could be explored further. It would be helpful if the author could include at lea
1. a clean definition for formalization of relative efficiency 2. CoThink works without architectural changes but just prompt engineering with two stages, simple and effective
1. this topic is also widely and deeply studied, and this paper does not provide new insights or surprising results 2. The mechanistic explanations including RL-induced verbosity and backward CoT patterns are speculative without rigorous evidence 3. Lines 192-194 claim RL reduces "per-step information density" but provide no direct evidence 4. The authors try to establish the scaling law, which has good intention, but how are the parameters fit? The scaling parameters in the Figure are simply
1. The paper is overall motivative. Through visual analyses such as Figure 1 and Figure 2, the paper illustrates the overthinking phenomenon and its strong correlation with task difficulty, providing a well-motivated foundation for the proposed efficiency metric. 2. The proposed method is reasonable. It proposes a two-stage pipeline, COTHINK, which uses an instruct model to draft a brief outline, and a thinking model to expand it. 3. Experimental results demonstrate the effectiveness of the prop
1. The findings for motivation is similar in existing studies. The two main observations in Section 2.1 (that instruct models are more efficient and reasoning models mainly help on hard problems) have already been reported in multiple prior works (e.g., AutoThink [1], Chen et al. [2], Sui et al. [3], Wang et al. [4]). These studies also show similar reasoning efficiency distributions across problem difficulty and input length. 2. The novelty is somewhat limited. Similar pipeline-based approaches
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsDialogue-Adaptive Pre-training Objective
