InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu

TL;DR
This paper introduces InterpDetect, a mechanistic approach using interpretable signals to detect hallucinations in Retrieval-Augmented Generation models, improving accuracy and generalizability over existing methods.
Contribution
It identifies the source of hallucinations in RAG models and develops a mechanistic detection method based on external context and parametric knowledge scores.
Findings
Mechanistic signals effectively predict hallucinations.
Classifiers trained on Qwen3-0.6b generalize to GPT-4.1-mini.
Proposed method outperforms state-of-the-art baselines.
Abstract
Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
