
TL;DR
This paper introduces a geometric taxonomy for detecting hallucinations in language models using embedding space analysis, providing interpretable methods suited for black-box deployment scenarios.
Contribution
It develops a geometry-based framework predicting which hallucination types are detectable, validated through a new human-confabulated dataset across multiple domains.
Findings
Query-proximate unfaithfulness detectable by angular ratio
Directional signatures outperform NLI in confabulation detection
Some factual errors are indistinguishable by angular geometry
Abstract
Hallucinations in deployed language models can have real consequences for downstream decisions in domains such as healthcare, legal, and financial services. In production, detection has to run on what the deployed system can see: the query, the response, and often a source document. White-box access to model internals and multi-sample querying are not generally available behind a third-party API. Within this setting - black-box, single-pass, only question/answer available - the dominant baseline is NLI, which returns a value but no diagnosis when it fails. We argue that operating directly on the geometry of the embedding space provides detection methods whose successes and failures are interpretable as structural properties of contrastive sentence-encoder training \citep{wang2020understanding}. The contribution is: given an operationally-motivated taxonomy, geometry predicts which types…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
