When Should a Language Model Trust Itself? Same-Model Self-Verification as a Conditional Confidence Signal
Aditya Ajay Phalod

TL;DR
This paper evaluates the effectiveness of self-verification as a confidence signal for language models, finding it varies significantly across tasks, models, and baselines, and is not a universal uncertainty estimator.
Contribution
It provides a comprehensive empirical analysis of self-verification against likelihood baselines across multiple models and tasks, highlighting its conditional utility.
Findings
Self-verification improves over baselines on ARC-Challenge for certain models.
On TruthfulQA-MC, self-verification is less reliable and often underperforms baselines.
The utility of self-verification depends on task, model, prompt, and baseline.
Abstract
Same-model self-verification, prompting a model to audit its own predicted answer, is a plausible confidence signal for selective prediction, but its practical value remains unclear once strong likelihood-based baselines are taken seriously. We evaluate self-verification against two such baselines, LL-AVG and LL-SUM, on ARC-Challenge and TruthfulQA-MC across multiple model families, scales, and prompt variants. We measure not only correctness ranking, but also abstention quality through AURC and operating-point analyses. The results are sharply task- and model-dependent. On ARC-Challenge, self-verification substantially improves over LL-AVG for Phi-2 and the Qwen models, with the largest gains appearing in Qwen-7B. On TruthfulQA-MC, however, the signal is less reliable: smaller models can become prompt-sensitive, DeepSeek-R1-Distill-8B degrades relative to LL-AVG, and LL-SUM often…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
