SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
Bartosz Cywi\'nski, Kamil Deja

TL;DR
SAeUron introduces a transparent, autoencoder-based method for removing unwanted concepts from diffusion models, improving safety and interpretability while maintaining performance and resisting adversarial content generation.
Contribution
The paper presents SAeUron, a novel autoencoder-based approach for concept unlearning in diffusion models that offers interpretability, precise intervention, and improved robustness over existing methods.
Findings
Outperforms existing unlearning methods on benchmark datasets
Effectively removes multiple concepts simultaneously
Reduces unwanted content generation under adversarial attacks
Abstract
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques
MethodsFeature Selection · Balanced Selection · Diffusion
