Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision

Chaeyun Jang; Moonseok Choi; Yegon Kim; Hyungi Lee; Juho Lee

arXiv:2506.03723·cs.CL·June 5, 2025

Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision

Chaeyun Jang, Moonseok Choi, Yegon Kim, Hyungi Lee, Juho Lee

PDF

Open Access 4 Reviews

TL;DR

This paper demonstrates that fine-tuning large language models with scalar confidence labels alone can induce self-verification behavior, improving calibration, accuracy, and interpretability without explicit reasoning supervision.

Contribution

It reveals that scalar confidence supervision alone can elicit self-verification in LLMs, a behavior previously thought to require explicit training signals.

Findings

01

Confidence-aware fine-tuning improves calibration and accuracy.

02

Models generate longer, self-checking responses for low-confidence queries.

03

Test-time scaling based on calibrated uncertainty boosts performance.

Abstract

Uncertainty calibration is essential for the safe deployment of large language models (LLMs), particularly when users rely on verbalized confidence estimates. While prior work has focused on classifiers or short-form generation, confidence calibration for chain-of-thought (CoT) reasoning remains largely unexplored. Surprisingly, we find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models, without any explicit reasoning supervision or reinforcement learning-based rewards. Despite being trained only to produce a verbalized confidence score without any self-verifying examples, the model learns to generate longer and self-checking responses for low-confidence queries while providing more concise answers for high-confidence ones. We further propose a simple rethinking method that boosts performance via test-time…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 2Confidence 4

Strengths

- The paper is well-written and easy to follow and understand. - Some of the experiments are novel and interesting, such as the cross-domain generalization that suggests that the learned verbalized confidence transfers beyond the original reasoning tasks.

Weaknesses

The proposed method was already introduced by Lin, Hilton, and Evans (2022) in “Teaching Models to Express Their Uncertainty in Words”. That paper presented the exact same method of training a model to output natural-language confidence statements using the model's own empirical accuracy as labels. The only difference here is that the authors round the model’s empirical confidence to the nearest 10% instead of using continuous estimates. Despite this near-identity, the authors do not cite or di

Reviewer 02Rating 4Confidence 5

Strengths

1. This paper shows a major conceptual insight, when trained to verbalize confidence, models could self-check low-confidence responses, which contributes to the understanding the relationship between verbalized uncertainty and model's reasoning ability. 2. This papers demonstrates strong empirical results by achieving consistent improvements on both reasoning like MATH, GPQA and non-reasoning tasks like GSM8K, ARC, HellaSwag.

Weaknesses

1. The idea of introducing confidence into LLM training is not new. There are several relevant studies: [1-3]. Especially, the paper misses the reference to those relevant studies. 2. The evaluation lack deeper qualititative analysis. Espeically more real-world scenarios are not explored. Current reasoning models are quite strong on more complex tasks even for a 4B model like Qwen3-4B. More evaluation should be included on more complex tasks like AIME, CodeLiveBench. 3. Since only the confidenc

Reviewer 03Rating 8Confidence 4

Strengths

The method provides confidence along the answers whiout altering the model too much due to KL-based normalization. The method improves on other tasks ie. generalizes The mothod seems to improve answer quality I enyoed reading the paper and was informative too me, good method.

Weaknesses

I could not figure out imediatley how they get a spectrum of confidence while the feedback is binary? Correct vs incorrect? Could you write this more clear pls? I wonder how much we loose actualy of the models abilities even which could be assessed. Maybe, missed this point. Missed out on some literature on attribution to a source like retrieval in a loop, e.g. attributed question answering which has similar aims which provides evidence via retrieval.

Reviewer 04Rating 2Confidence 4

Strengths

- **Simple, relevant and well-motivated**: The proposed method is simple yet tackles the important problem of model calibration. Calibration is increasingly critical for deploying reasoning models in real-world applications. - **Interesting results**: Despite its simplicity, the method appears to produce notable and somewhat surprising empirical results (e.g., accuracy improvements), which merit deeper investigation.

Weaknesses

- **Novelty**: It is unclear what the novelty of this framework is. There have been previous works from as long as 3 years ago [1] which use a very similar setup. - **Experimental Design and Strength**: No baselines (except a simple pre-trained model) are presented in the results, making it difficult to identify the strengths of the method relative to existing works in the field. Some results presented in this paper are also surprising, and merit deeper analysis. In particular, it is unclear to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)