Deep Think with Confidence
Yichao Fu, Xuewei Wang, Yuandong Tian, Jiawei Zhao

TL;DR
DeepConf improves reasoning accuracy and efficiency in large language models by dynamically filtering low-quality outputs using internal confidence signals, without extra training or tuning.
Contribution
DeepConf introduces a confidence-based filtering method that enhances reasoning performance and reduces computation in LLMs without additional training.
Findings
Achieves up to 99.9% accuracy on AIME 2025
Reduces generated tokens by up to 84.7%
Seamlessly integrates into existing frameworks
Abstract
Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025,…
Peer Reviews
Decision·ICLR 2026 Poster
1. Practical impact and simplicity: DeepConf can be deployed immediately without retraining or external calibration. It’s a drop-in improvement to any self-consistency pipeline. 2. Local vs. global insight: The authors convincingly show that global average confidence can mask local failures while sliding-window metrics like LGC and Tail Confidence capture these weak spots better. 3. Strong empirical results: Large-scale experiments (5 datasets, 4 models, 64× repeats) demonstrate consistent impro
1. Confidence calibration: The authors treat percentile thresholds ($\eta = 10 / 90$) as hyperparameters, but the actual confidence scores are not calibrated across models. Reporting AUROC or ECE/Brier scores for trace-level discrimination would provide a more rigorous evaluation of confidence quality. 2. “Confidently wrong” cases: The authors briefly mention (Appendix D) that models sometimes assign high confidence to incorrect reasoning modes, but there’s no robust mitigation strategy. A hybri
1. Clear Motivation and Problem Framing: The paper provides a clear and intuitive justification for moving from global to local confidence metrics. 2. The experimental setup is comprehensive: The authors test across multiple model families and scales, on a suite of reasoning and QA benchmarks. The paper also includes detailed ablations on hyperparameters introduced in the method. 3. The proposed method is practical and simple with multiple variants that can adapte to different downstream need, m
1. Incremental Novelty: The primary novelty lies in the specific local metrics and the online early-stopping mechanism compared to prior works (e.g., Kang et al., 2025). While effective, the proposed method is an incremental improvement on existing ideas. 2. Overstated “No Hyperparameter Tuning.” The abstract’s “no … hyperparameter tuning” assertion is not faithful to the method as presented. The method introduces several new parameters that require choices: the filtering ratio, the online conse
1. **Simple and effective with practical value.** Relying only on *Bottom 10% Group Confidence* enables effective adaptive sampling with early stopping and confidence filtering of high-quality reasoning paths. In both offline and online settings, it markedly lowers inference cost while delivering better performance, indicating broad applicability. 2. **Comprehensive and rigorous experiments.** The paper evaluates five mainstream open-source reasoning LLMs from three model families on five chall
1. **Limited task generalization.** The evaluation focuses primarily on mathematical reasoning, with little assessment in verifiable settings such as code generation. Adding at least one non-mathematical reasoning benchmark would strengthen claims of generality. 2. **Missing highly relevant strong baselines.** While the paper centers on confidence-based methods and compares unweighted vs. confidence-weighted majority voting, it lacks comparisons with closely related, strong baselines—e.g., ESC
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
