Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang; Zihao Li; Benyou Wang; Yan Hu; Difan Zou

arXiv:2505.24428·cs.CL·June 2, 2025

Model Unlearning via Sparse Autoencoder Subspace Guided Projections

Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou

PDF

Open Access 1 Video

TL;DR

This paper introduces SSPU, a novel framework using sparse autoencoder features to enable precise, interpretable, and robust unlearning in large language models, effectively removing harmful knowledge while preserving utility.

Contribution

The paper proposes SSPU, a new subspace-guided unlearning method leveraging SAE features for targeted, interpretable, and robust knowledge removal in large language models.

Findings

01

Reduces harmful knowledge accuracy by 3.22% compared to baselines.

02

Improves adversarial robustness against jailbreak prompts.

03

Exposes limitations of prior unlearning methods.

Abstract

Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space, enabling precise, interpretable, and robust unlearning. SSPU's three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an "irrelevant" subspace while preserving retained knowledge. Overall, we use SAE features to construct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Model Unlearning via Sparse Autoencoder Subspace Guided Projections· underline

Taxonomy

TopicsNeural Networks and Applications · Image Processing and 3D Reconstruction · Advanced Neural Network Applications

MethodsSparse Autoencoder · Sparse Evolutionary Training