Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations
Dahye Kim, Deepti Ghadiyaram

TL;DR
This paper introduces a novel k-sparse autoencoder framework that enables efficient, test-time control over generative models to manipulate or remove unsafe concepts without retraining, improving safety and style control.
Contribution
It presents a new method using k-sparse autoencoders for interpretable, test-time concept manipulation in diffusion models, avoiding retraining and maintaining quality.
Findings
20.01% improvement in unsafe concept removal
Effective style manipulation capabilities
Approximately 5x faster than existing methods
Abstract
Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) -- all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCancer-related molecular mechanisms research · Water Systems and Optimization · Machine Learning and Data Classification
MethodsDiffusion · Balanced Selection
