Steering Language Model Refusal with Sparse Autoencoders

Kyle O'Brien; David Majercak; Xavier Fernandes; Richard Edgar; Blake Bullwinkel; Jingya Chen; Harsha Nori; Dean Carignan; Eric Horvitz; Forough Poursabzi-Sangdeh

arXiv:2411.11296·cs.LG·July 14, 2025

Steering Language Model Refusal with Sparse Autoencoders

Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangdeh

PDF

Open Access

TL;DR

This paper investigates using sparse autoencoders to steer language model responses for safety, finding it improves refusal robustness but also degrades overall performance, revealing complex safety-capability tradeoffs.

Contribution

It introduces a novel inference-time steering method using sparse autoencoders and analyzes its impact on safety and model capabilities.

Findings

01

SAE steering enhances safety against jailbreaks

02

Performance drops across multiple benchmark tasks

03

Safety features are entangled with core model capabilities

Abstract

Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we explore an alternative: steering model activations at inference time via amplifying sparse autoencoder (SAE) features that mediate refusal. This work uncovers a fundamental tension between SAE steering-based safety improvements and general model capabilities. While feature steering successfully improves robustness against both single-turn and challenging multi-turn jailbreak attempts, we discover that this comes at a previously underexplored cost -- systematic degradation of performance across multiple benchmark tasks, even on safe inputs with no apparent connection to refusal behavior. This suggests that features mediating refusal may be more deeply entangled with general language model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling