Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas

TL;DR
This paper introduces CC-Delta, a sparse autoencoder-based method that effectively mitigates jailbreak attacks on large language models by steering in sparse feature space, outperforming dense space methods across multiple models and attack types.
Contribution
The paper presents a novel SAE-based defense, CC-Delta, that detects and mitigates jailbreak prompts by steering in sparse feature space, showing improved robustness and utility.
Findings
CC-Delta outperforms dense mean-shift steering on all models.
The method is effective against out-of-distribution attacks.
SAEs trained for interpretability can be repurposed as jailbreak defenses.
Abstract
Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling
