InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

Likun Tan; Kuan-Wei Huang; Joy Shi; Kevin Wu

arXiv:2510.21538·cs.CL·October 27, 2025

InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu

PDF

TL;DR

This paper introduces InterpDetect, a mechanistic approach using interpretable signals to detect hallucinations in Retrieval-Augmented Generation models, improving accuracy and generalizability over existing methods.

Contribution

It identifies the source of hallucinations in RAG models and develops a mechanistic detection method based on external context and parametric knowledge scores.

Findings

01

Mechanistic signals effectively predict hallucinations.

02

Classifiers trained on Qwen3-0.6b generalize to GPT-4.1-mini.

03

Proposed method outperforms state-of-the-art baselines.

Abstract

Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.