Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba; Jacopo Cortellazzi; Javier Abad; Pau Rodriguez; Xavier Suau; Arno Blaas

arXiv:2602.12418·cs.CR·February 16, 2026

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas

PDF

Open Access

TL;DR

This paper introduces CC-Delta, a sparse autoencoder-based method that effectively mitigates jailbreak attacks on large language models by steering in sparse feature space, outperforming dense space methods across multiple models and attack types.

Contribution

The paper presents a novel SAE-based defense, CC-Delta, that detects and mitigates jailbreak prompts by steering in sparse feature space, showing improved robustness and utility.

Findings

01

CC-Delta outperforms dense mean-shift steering on all models.

02

The method is effective against out-of-distribution attacks.

03

SAEs trained for interpretability can be repurposed as jailbreak defenses.

Abstract

Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling