The First Token Knows: Single-Decode Confidence for Hallucination Detection
Mina Gabriel

TL;DR
The paper introduces phi_first, a low-cost confidence measure based on the first token's logits, which effectively detects hallucinations in language models, matching or surpassing more costly methods.
Contribution
It demonstrates that first-token confidence from a single greedy decode can serve as an efficient alternative to semantic self-consistency for hallucination detection.
Findings
phi_first achieves a mean AUROC of 0.820 across models and benchmarks.
It matches or exceeds semantic self-consistency in hallucination detection.
Combining phi_first with semantic agreement yields minimal additional benefit.
Abstract
Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
