Do Androids Know They're Only Dreaming of Electric Sheep?

Sky CH-Wang; Benjamin Van Durme; Jason Eisner; Chris Kedzie

arXiv:2312.17249·cs.CL·June 11, 2024·2 cites

Do Androids Know They're Only Dreaming of Electric Sheep?

Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Chris Kedzie

PDF

Open Access

TL;DR

This paper introduces a probing method to detect hallucinations in transformer language models, showing high accuracy and efficiency, especially within the same domain and at early layers, outperforming some existing baselines and human annotators.

Contribution

The study demonstrates that probing internal transformer representations can reliably detect hallucinations, providing a practical alternative to traditional evaluation methods.

Findings

01

Probes detect hallucinations with 95% accuracy at early layers.

02

Probing outperforms several baselines and surpasses human annotators in response-level detection.

03

Probes are domain-sensitive and generalize poorly across different tasks or synthetic/organic data.

Abstract

We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFunctional Brain Connectivity Studies · Digital Mental Health Interventions · Mental Health Research Topics