SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Bartosz Cywi\'nski; Kamil Deja

arXiv:2501.18052·cs.LG·May 23, 2025

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Bartosz Cywi\'nski, Kamil Deja

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

SAeUron introduces a transparent, autoencoder-based method for removing unwanted concepts from diffusion models, improving safety and interpretability while maintaining performance and resisting adversarial content generation.

Contribution

The paper presents SAeUron, a novel autoencoder-based approach for concept unlearning in diffusion models that offers interpretability, precise intervention, and improved robustness over existing methods.

Findings

01

Outperforms existing unlearning methods on benchmark datasets

02

Effectively removes multiple concepts simultaneously

03

Reduces unwanted content generation under adversarial attacks

Abstract

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cywinski/saeuron
pytorchOfficial

Models

Videos

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders· slideslive

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques

MethodsFeature Selection · Balanced Selection · Diffusion