Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo; Nirmalendu Prakash; Clement Neo; Roy Ka-Wei Lee; Erik Cambria; Ranjan Satapathy

arXiv:2505.23556·cs.CL·May 30, 2025

Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy

PDF

Open Access 1 Repo

TL;DR

This paper investigates the internal mechanisms of refusal behavior in instruction-tuned language models using sparse autoencoders, revealing causal features and improving understanding of safety behaviors.

Contribution

It introduces a mechanistic approach with sparse autoencoders to identify and intervene on latent features mediating refusals in language models, advancing interpretability.

Findings

01

Identified latent features causally linked to refusal behavior

02

Validated intervention effects on refusal across harmful datasets

03

Enhanced generalization of refusal features for adversarial robustness

Abstract

Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wj210/refusal_sae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)