SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han; Wujiang Xu; Mingyu Jin; Mengnan Du

arXiv:2511.20820·cs.CL·February 11, 2026

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han, Wujiang Xu, Mingyu Jin, Mengnan Du

PDF

Open Access 1 Video

TL;DR

SAGE introduces an active, explanation-driven framework for interpreting SAE features in language models, significantly improving the accuracy and reliability of feature explanations compared to existing methods.

Contribution

The paper presents SAGE, a novel agent-based framework that systematically formulates, tests, and refines explanations for SAE features in LLMs, advancing interpretability techniques.

Findings

01

SAGE achieves higher explanation accuracy than state-of-the-art baselines.

02

SAGE provides more reliable and interpretable features across diverse language models.

03

The framework enhances understanding of LLM internal mechanisms.

Abstract

Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling