Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

Aaron J. Li; Suraj Srinivas; Usha Bhalla; Himabindu Lakkaraju

arXiv:2505.16004·cs.LG·January 26, 2026

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the robustness of concept representations in sparse autoencoders used with large language models, revealing their vulnerability to adversarial input perturbations and highlighting the need for improved stability for reliable model monitoring.

Contribution

The paper introduces a new framework for evaluating the robustness of SAE concept representations against adversarial perturbations, emphasizing the importance of stability in interpretability tools.

Findings

01

Tiny adversarial perturbations can manipulate SAE interpretations

02

SAE concept representations are fragile without denoising

03

Robustness is critical for reliable model monitoring

Abstract

Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4life-group/sae_robustness
jaxOfficial

Videos

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning

MethodsFocus · Balanced Selection