THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models
Zhiyuan Li, Yi Chang, Yuan Wu

TL;DR
This paper introduces Think-Bench, a benchmark for assessing the reasoning efficiency of large reasoning models, highlighting prevalent overthinking issues and proposing metrics to improve computational resource utilization.
Contribution
It presents a systematic benchmark and novel metrics to evaluate and analyze the reasoning efficiency and chain-of-thought quality of large reasoning models.
Findings
Most LRMs overthink easy questions, producing unnecessarily long reasoning chains.
Many LRMs have high chain-of-thought quality but low efficiency.
Overthinking leads to significant computational resource waste.
Abstract
Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating…
Peer Reviews
Decision·Submitted to ICLR 2026
Process-level evaluation. Goes beyond final accuracy to step-level CoT quality (recall/precision), enabled by human key-step annotations and a judging pipeline. This addresses a real gap in current benchmarks. Clear efficiency metrics. The definitions of Efficiency (first-correct/total), Reflection Tokens, and Thought Num are intuitive and operationalizable. Empirical findings. Consistent evidence of “overthinking,” differences across subjects and difficulty, and model-specific trade-offs are
1. Missing marquee closed models. The model list includes Claude 3.7 Sonnet, DeepSeek-R1(+distills), Qwen3-235B, Grok-3-mini, ERNIE-X1-Turbo, GLM-Z1-Air—but not GPT-4/5 or Gemini-2.5-Pro. For a paper about thinking efficiency of LRMs, omission of today’s most-used LLMs weakens external validity. Authors do note some models don’t expose CoT, but a discussion/ablation on “no-CoT models” or proxy evaluations would help. 2. Simplicity / metric specificity. While practical, some metrics (e.g., Thoug
1. The idea of evaluating LRMs on both reasoning efficiency and CoT quality is innovative, providing a new framework that looks beyond the final answer accuracy, which is a limitation of many existing benchmarks. 2. The paper performs a detailed evaluation of 11 models, including both proprietary and open-source LRMs, across various domains and difficulty levels, offering insights into current reasoning inefficiencies. 3. The proposed metrics for reasoning quality, like reflection quality, con
1. Numerous studies have already proposed solutions to address overthinking [1-3], such as dynamic reasoning path design or early-exit mechanisms. This paper merely cites these ideas as potential future directions without providing any concrete comparative analysis, resulting in a rather weak analytical foundation. 2. The definition of “efficiency” is quite ambiguous, and it is unclear which type of efficiency the paper refers to. For instance, if efficiency is defined from the user’s perspecti
S1. The problem addressed in this paper is indeed a real challenge faced by large-scale inference models, especially given the increasing popularity of test-time-scaling strategies. This paper not only focuses on the accuracy of the final answer but also delves into the efficiency and quality of the reasoning process, filling a significant gap in existing evaluation systems. The proposed six efficiency indicators and two quality indicators complement each other, forming a relatively complete eva
W1. The core assessment in this paper relies on Claude 3.7 Sonnet to determine the correctness of reasoning steps and their match with the reference answer, but the reliability of this assessor is not adequately verified. There is no report on the inter-annotator agreement between human assessment and LLM assessment, nor is there an analysis of potential systematic biases introduced by Claude. For example, Claude may have a preference for certain expression styles or reasoning paths, which could
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
