Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models

Soumya Suvra Ghosal; Souradip Chakraborty; Avinash Reddy; Yifu Lu; Mengdi Wang; Dinesh Manocha; Furong Huang; Mohammad Ghavamzadeh; Amrit Singh Bedi

arXiv:2506.04210·cs.AI·October 24, 2025

Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models

Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, Amrit Singh Bedi

PDF

Open Access

TL;DR

Extending reasoning traces at test-time often leads to overthinking, which increases output variance and can decrease accuracy, but a parallel thinking approach can improve performance by selecting the most consistent response.

Contribution

The paper demonstrates that test-time extended thinking can be counterproductive and introduces a parallel thinking method that improves reasoning accuracy through majority voting.

Findings

01

Extended thinking initially improves performance but then declines due to overthinking.

02

Additional thinking increases output variance, creating an illusion of better reasoning.

03

Parallel thinking with majority voting outperforms extended thinking, achieving up to 20% higher accuracy.

Abstract

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance-creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing