Weakly Supervised Detection of Hallucinations in LLM Activations

Miriam Rateike; Celia Cintas; John Wamburu; Tanya Akumu; Skyler; Speakman

arXiv:2312.02798·cs.LG·December 6, 2023·2 cites

Weakly Supervised Detection of Hallucinations in LLM Activations

Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, Skyler, Speakman

PDF

Open Access 1 Repo

TL;DR

This paper presents a weakly supervised auditing method to detect hallucination patterns in large language models' internal activations, enabling identification of responsible nodes without prior pattern knowledge.

Contribution

It introduces a novel subset scanning approach for anomaly detection in LLM activations that does not require pre-labeled pattern types or extensive supervision.

Findings

01

OPT encodes hallucination information internally.

02

BERT has limited capacity for hallucination encoding.

03

The method performs comparably to supervised classifiers without prior pattern knowledge.

Abstract

We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Trusted-AI/adversarial-robustness-toolbox
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Ferroelectric and Negative Capacitance Devices · Topic Modeling

MethodsOPT