Do Androids Know They're Only Dreaming of Electric Sheep?
Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Chris Kedzie

TL;DR
This paper introduces a probing method to detect hallucinations in transformer language models, showing high accuracy and efficiency, especially within the same domain and at early layers, outperforming some existing baselines and human annotators.
Contribution
The study demonstrates that probing internal transformer representations can reliably detect hallucinations, providing a practical alternative to traditional evaluation methods.
Findings
Probes detect hallucinations with 95% accuracy at early layers.
Probing outperforms several baselines and surpasses human annotators in response-level detection.
Probes are domain-sensitive and generalize poorly across different tasks or synthetic/organic data.
Abstract
We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFunctional Brain Connectivity Studies · Digital Mental Health Interventions · Mental Health Research Topics
