A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

Charles O'Neill; Slava Chalnev; Chi Chi Zhao; Max Kirkby; Mudith Jayasekara

arXiv:2507.23221·cs.LG·August 1, 2025

A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

Charles O'Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara

PDF

Open Access

TL;DR

This paper introduces a linear residual probe within an observer model that detects and controls AI hallucinations in text generation, providing a practical interpretability method and a new benchmark for evaluation.

Contribution

It presents a transferable linear probe for detecting hallucinations in language models and demonstrates causal manipulation of hallucination rates, advancing interpretability and mitigation techniques.

Findings

01

Linear residual probe outperforms baselines by 5-27 points.

02

Robust detection across models from 2B to 27B parameters.

03

Causal manipulation of hallucination rates is demonstrated.

Abstract

Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health Research Topics · Functional Brain Connectivity Studies · Paranormal Experiences and Beliefs