Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?

Samuel Lewis-Lim; Xingwei Tan; Zhixue Zhao; Nikolaos Aletras

arXiv:2510.21007·cs.CL·January 9, 2026

Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?

Samuel Lewis-Lim, Xingwei Tan, Zhixue Zhao, Nikolaos Aletras

PDF

Open Access 4 Reviews

TL;DR

This paper investigates confidence-based gating to determine when large language models should use chain-of-thought reasoning, aiming to reduce unnecessary computation while maintaining reasoning accuracy.

Contribution

It introduces a systematic evaluation framework for confidence signals and compares multiple measures for selective reasoning in LLMs, providing practical guidance.

Findings

01

Confidence measures can reduce redundant reasoning

02

Performance of confidence signals varies across tasks

03

Existing measures are training-free and effective in some settings

Abstract

Chain-of-thought (CoT) prompting is a common technique for improving the reasoning abilities of large language models (LLMs). However, extended reasoning is often unnecessary and substantially increases token usage. As such, a key question becomes how to optimally allocate compute to when reasoning is actually needed. We study this through confidence-gated CoT, where a model produces a direct answer and a confidence estimate to decide whether to invoke CoT. We present an evaluation framework together with the first systematic study of confidence signals for this decision. We evaluate four representative confidence measures and compare them with random gating and an oracle upper bound. Experiments across two model families and diverse reasoning tasks show that existing training-free confidence measures can reduce redundant reasoning. However, we also find that the utility of individual…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. The calibration of verbalized confidence in LLMs is a critical research topic, and this study has significant practical implications. 2. The experimentation is comprehensive and in-depth, thoroughly demonstrating that the proposed methods can simultaneously reduce token usage while preserving accuracy. 3. The proposed confidence-gating pipeline is intuitive and elegant, offering a straightforward approach to effectively address the important problem of deciding when a model should engage in

Weaknesses

1. While the paper introduces a promising research direction and evaluates several feasible methods, it lacks a unified or theoretically grounded framework, which limits its theoretical contribution. 2. In Figure 2, several confidence estimation methods underperform the random baseline. A theoretical analysis from the authors explaining why and under what conditions these methods fail would significantly strengthen the paper. 3. Some important related work for LLM confidence calibration is not

Reviewer 02Rating 4Confidence 4

Strengths

- The paper addresses a timely and relevant problem and is supported by solid empirical evaluations. - The writing is clear, well-organized, and easy to follow.

Weaknesses

- **Oracle and CoT routing interpretation** In Figures 2 and 3, the oracle triggers CoT only when a direct answer is incorrect (line 286). However, the gap between “always using CoT” and the oracle is very large, suggesting that CoT severely harms performance when applied indiscriminately. This raises a key question: does the oracle use only *one round* of CoT, matching other baselines? If so, the results may indicate that *offline CoT routing itself is ineffective*, which deserves clearer discu

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper tackles a clear and practical problem, deciding when CoT reasoning is necessary through an intuitive confidence-based gating framework. 2. The experiments are systematic and broad, covering multiple models and reasoning benchmarks with solid comparisons to random and oracle baselines. 3. Results demonstrate meaningful token savings with minimal accuracy loss, showing practical value for efficient reasoning.

Weaknesses

1. I am concerned that the contribution of this paper is limited. If the goal is to propose an efficient reasoning method, then the performance is not good enough and there are many efficient CoT baselines to be compared with. If the goal is to simply analyze the confidence-gating, then the scope is way too narrow, as effectively measuring the confidence of LLM generation itself is still a problem to be solved. The insights from this paper are limited. 2. The proposed confidence signals exhibi

Reviewer 04Rating 2Confidence 4

Strengths

1. In general, the writing of the paper is clear and easy to follow. 2. The focus of the paper, on how to determine the necessity of using CoT, is meaningful and interesting. 3. Multiple tasks are considered in the experiment section, which provides extensive results on how confidence-gated CoT behaves across different reasoning types and model scales.

Weaknesses

1. A major concern is that the positioning of the paper is somewhat unclear. While it introduces a method for confidence-based control of CoT reasoning, the proposed approach fails to consistently demonstrate strong performance across models and tasks. As a result, the paper shifts its focus toward analyzing whether confidence estimates can be used to decide when CoT is needed. However, since the analysis is conducted only on this specific implementation, the conclusions are method-specific and

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques