TL;DR
This paper introduces LLMThinkBench, a benchmark and empirical study analyzing the tradeoff between accuracy and verbosity in LLMs' basic math reasoning, revealing that longer reasoning does not always improve performance.
Contribution
The paper formalizes the accuracy-verbosity tradeoff, introduces the Overthinking Score, and provides a large-scale evaluation of 53 LLMs with open-source tools and benchmarks.
Findings
Model performance on complex benchmarks does not translate to basic math reasoning.
Reasoning models generate ~18x more tokens but sometimes have lower accuracy.
Extended reasoning budgets yield diminishing returns, with some models showing no accuracy gain.
Abstract
Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present LLMThinkBench, a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Fifth, we release LLMThinkBench as an open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
