The First Token Knows: Single-Decode Confidence for Hallucination Detection

Mina Gabriel

arXiv:2605.05166·cs.CL·May 7, 2026

The First Token Knows: Single-Decode Confidence for Hallucination Detection

Mina Gabriel

PDF

TL;DR

The paper introduces phi_first, a low-cost confidence measure based on the first token's logits, which effectively detects hallucinations in language models, matching or surpassing more costly methods.

Contribution

It demonstrates that first-token confidence from a single greedy decode can serve as an efficient alternative to semantic self-consistency for hallucination detection.

Findings

01

phi_first achieves a mean AUROC of 0.820 across models and benchmarks.

02

It matches or exceeds semantic self-consistency in hallucination detection.

03

Combining phi_first with semantic agreement yields minimal additional benefit.

Abstract

Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.