Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

Gaurav Srivastava; Aafiya Hussain; Sriram Srinivasan; Xuan Wang

arXiv:2507.04023·cs.CL·April 24, 2026

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang

PDF

1 Repo

TL;DR

This paper introduces LLMThinkBench, a benchmark and empirical study analyzing the tradeoff between accuracy and verbosity in LLMs' basic math reasoning, revealing that longer reasoning does not always improve performance.

Contribution

The paper formalizes the accuracy-verbosity tradeoff, introduces the Overthinking Score, and provides a large-scale evaluation of 53 LLMs with open-source tools and benchmarks.

Findings

01

Model performance on complex benchmarks does not translate to basic math reasoning.

02

Reasoning models generate ~18x more tokens but sometimes have lower accuracy.

03

Extended reasoning budgets yield diminishing returns, with some models showing no accuracy gain.

Abstract

Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present LLMThinkBench, a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Fifth, we release LLMThinkBench as an open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ctrl-gaurav/LLMThinkBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.