TL;DR
This paper explores the use of sparse autoencoders and dictionary learning techniques to interpret and control large text-to-image diffusion models, demonstrating improved interpretability and steering capabilities.
Contribution
It applies SAE and ITDA methods to a large diffusion model, introducing an automated interpretation pipeline and showing enhanced interpretability and control.
Findings
SAEs accurately reconstruct residual stream embeddings
SAEs outperform MLP neurons in interpretability
SAEs enable steering of image generation
Abstract
Sparse autoencoders are a promising new approach for decomposing language model activations for interpretation and control. They have been applied successfully to vision transformer image encoders and to small-scale diffusion models. Inference-Time Decomposition of Activations (ITDA) is a recently proposed variant of dictionary learning that takes the dictionary to be a set of data points from the activation distribution and reconstructs them with gradient pursuit. We apply Sparse Autoencoders (SAEs) and ITDA to a large text-to-image diffusion model, Flux 1, and consider the interpretability of embeddings of both by introducing a visual automated interpretation pipeline. We find that SAEs accurately reconstruct residual stream embeddings and beat MLP neurons on interpretability. We are able to use SAE features to steer image generation through activation addition. We find that ITDA has…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Softmax · Multi-Head Attention · Attention Is All You Need · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer · Diffusion · Sparse Evolutionary Training
