Don't Think Twice! Over-Reasoning Impairs Confidence Calibration

Romain Lacombe; Kerrie Wu; Eddie Dilworth

arXiv:2508.15050·cs.AI·August 22, 2025

Don't Think Twice! Over-Reasoning Impairs Confidence Calibration

Romain Lacombe, Kerrie Wu, Eddie Dilworth

PDF

Open Access

TL;DR

This paper shows that increasing reasoning steps in large language models can impair confidence calibration, and that search-augmented methods significantly improve confidence accuracy in knowledge-intensive tasks.

Contribution

It challenges the belief that more reasoning always improves calibration, highlighting the importance of information access over reasoning depth.

Findings

01

Longer reasoning budgets lead to overconfidence and worse calibration.

02

Search-augmented generation outperforms pure reasoning, achieving 89.3% accuracy.

03

Increasing reasoning steps does not improve, and can harm, confidence calibration.

Abstract

Large Language Models deployed as question answering tools require robust calibration to avoid overconfidence. We systematically evaluate how reasoning capabilities and budget affect confidence assessment accuracy, using the ClimateX dataset (Lacombe et al., 2023) and expanding it to human and planetary health. Our key finding challenges the "test-time scaling" paradigm: while recent reasoning LLMs achieve 48.7% accuracy in assessing expert confidence, increasing reasoning budgets consistently impairs rather than improves calibration. Extended reasoning leads to systematic overconfidence that worsens with longer thinking budgets, producing diminishing and negative returns beyond modest computational investments. Conversely, search-augmented generation dramatically outperforms pure reasoning, achieving 89.3% accuracy by retrieving relevant evidence. Our results suggest that information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education