Know When You're Wrong: Aligning Confidence with Correctness for LLM Error Detection
Xie Xiaohu, Liu Xiaohu, Yao Benjamin

TL;DR
This paper introduces a normalized confidence score for LLMs to reliably detect errors and hallucinations, improving trustworthiness and enabling efficient retrieval-augmented generation.
Contribution
It proposes a confidence scoring method, analyzes calibration effects of training techniques, and demonstrates practical error detection and correction in LLMs.
Findings
Supervised fine-tuning improves confidence calibration.
RL methods like PPO and DPO cause overconfidence.
Adaptive retrieval with confidence scores enhances accuracy with fewer retrievals.
Abstract
As large language models (LLMs) are increasingly deployed in critical decision-making systems, the lack of reliable methods to measure their uncertainty presents a fundamental trustworthiness risk. We introduce a normalized confidence score based on output anchor token probabilities: classification labels for structured tasks and self-evaluation responses (Yes/No) for open-ended generation. This enables direct detection of errors and hallucinations with minimal overhead and without external validation. We make three key contributions. First, we propose a normalized confidence score and self-evaluation framework that exposes reliable confidence estimates for error detection across seven diverse benchmark tasks and five LLMs of varying architectures and sizes. Second, our theoretical analysis reveals that supervised fine-tuning (SFT) yields well-calibrated confidence through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
