Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Guangzhi Xiong; Zhenghao He; Bohan Liu; Sanchit Sinha; Aidong Zhang

arXiv:2512.08892·cs.CL·February 12, 2026

Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RAGLens, a lightweight hallucination detector for Retrieval-Augmented Generation that leverages sparse autoencoders to interpret internal LLM signals, significantly improving faithfulness detection and providing interpretable rationales.

Contribution

The paper presents RAGLens, a novel method using sparse autoencoders to detect RAG hallucinations based on internal representations, with superior accuracy and interpretability over existing methods.

Findings

01

RAGLens outperforms existing hallucination detection methods.

02

It provides interpretable rationales for detection decisions.

03

The approach reveals insights into hallucination signals within LLMs.

Abstract

Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

Novel approach that provides both token-level hallucination detections through SAE features. Empirical results show that this approach outperforms all tested contemporary baselines. The approach is generally very lightweight if you have a trained SAE for the model. The method represents a contribution towards utilizing SAE features for hallucination detection in RAG settings.

Weaknesses

The subsection "4.3 GENERALIZATION ACROSS LLMS" is not greatly named; the authors here only test if each model's own trained GAM outperforms the baseline of the model itself's chain-of-thought explanation of if it hallucinated or not. The method itself, due to needing to be trained on model-specific SAE features, doesn't translate across LLMs- that is, for every LLM, you will need to train your own model. The authors also don't really test generalization *across domains*, even for the same model

Reviewer 02Rating 6Confidence 3

Strengths

1. Innovative use of sparse autoencoders to dissect RAG internal activations and link specific features to hallucination phenomena. 2. Interpretability focus — provides human-understandable explanations and visualizations for neuron activations contributing to hallucination or faithful grounding. 3. Comprehensive empirical setup, covering both quantitative metrics and qualitative feature studies on multiple benchmarks. 4. Clear motivation and relevance to improving trustworthy and explainable RA

Weaknesses

1. Lack of causal intervention — The paper provides interpretive insights but does not test whether manipulating identified SAE features can reduce hallucinations. Given that SAE features allow for activation-level control, it would be valuable to explore feature interventions to demonstrate causality. 2. Unclear explanation data source — In Section 4.4, the authors show two representative features and mention that the semantic explanations were distilled from 24 activation cases. However, it is

Reviewer 03Rating 4Confidence 4

Strengths

Unlike previous approaches, this paper takes an innovative direction by incorporating SAE-derived features for hallucination detection. This approach not only enables accurate identification of hallucinations but also offers interpretable explanations for their generation, making the proposed framework both elegant and effective. Empirical results show that RAGLens achieves superior performance on two hallucination benchmarks compared with multiple baselines. The ablation studies further valid

Weaknesses

A key concern is that the paper does not clearly establish the causal connection between SAE features and hallucination behavior. Although the theoretical analysis argues that, under sparse activation, max pooling can amplify hallucination-related signals and suppress noise, it remains unclear whether the improved detection performance primarily stems from the SAE-derived features or from the predictive capacity of the GAM. If the authors could provide stronger evidence that the learned SAE feat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Misinformation and Its Impacts · Adversarial Robustness in Machine Learning