Model Unlearning via Sparse Autoencoder Subspace Guided Projections
Xu Wang, Zihao Li, Benyou Wang, Yan Hu, Difan Zou

TL;DR
This paper introduces SSPU, a novel framework using sparse autoencoder features to enable precise, interpretable, and robust unlearning in large language models, effectively removing harmful knowledge while preserving utility.
Contribution
The paper proposes SSPU, a new subspace-guided unlearning method leveraging SAE features for targeted, interpretable, and robust knowledge removal in large language models.
Findings
Reduces harmful knowledge accuracy by 3.22% compared to baselines.
Improves adversarial robustness against jailbreak prompts.
Exposes limitations of prior unlearning methods.
Abstract
Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space, enabling precise, interpretable, and robust unlearning. SSPU's three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an "irrelevant" subspace while preserving retained knowledge. Overall, we use SAE features to construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Image Processing and 3D Reconstruction · Advanced Neural Network Applications
MethodsSparse Autoencoder · Sparse Evolutionary Training
