TL;DR
This paper investigates how integrating sparse autoencoders into transformer models can significantly improve robustness against jailbreak attacks, revealing a tradeoff between sparsity, layer depth, and model utility.
Contribution
It introduces a method of augmenting transformer models with pretrained sparse autoencoders at inference time to enhance robustness without retraining.
Findings
SAE augmentation reduces jailbreak success rate up to 5x
Sparsity level correlates with attack success rate
Intermediate layers offer optimal robustness-utility balance
Abstract
Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
