Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed; Sabrina Sadiekh; Chirag Agarwal

arXiv:2604.18756·cs.LG·April 22, 2026

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

PDF

1 Repo

TL;DR

This paper investigates how integrating sparse autoencoders into transformer models can significantly improve robustness against jailbreak attacks, revealing a tradeoff between sparsity, layer depth, and model utility.

Contribution

It introduces a method of augmenting transformer models with pretrained sparse autoencoders at inference time to enhance robustness without retraining.

Findings

01

SAE augmentation reduces jailbreak success rate up to 5x

02

Sparsity level correlates with attack success rate

03

Intermediate layers offer optimal robustness-utility balance

Abstract

Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aikyamlab/sparse-jailbreak
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.