Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning

Stepan Shabalin; Ayush Panda; Dmitrii Kharlapenko; Abdur Raheem Ali; Yixiong Hao; Arthur Conmy

arXiv:2505.24360·cs.LG·July 14, 2025

Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning

Stepan Shabalin, Ayush Panda, Dmitrii Kharlapenko, Abdur Raheem Ali, Yixiong Hao, Arthur Conmy

PDF

1 Repo

TL;DR

This paper explores the use of sparse autoencoders and dictionary learning techniques to interpret and control large text-to-image diffusion models, demonstrating improved interpretability and steering capabilities.

Contribution

It applies SAE and ITDA methods to a large diffusion model, introducing an automated interpretation pipeline and showing enhanced interpretability and control.

Findings

01

SAEs accurately reconstruct residual stream embeddings

02

SAEs outperform MLP neurons in interpretability

03

SAEs enable steering of image generation

Abstract

Sparse autoencoders are a promising new approach for decomposing language model activations for interpretation and control. They have been applied successfully to vision transformer image encoders and to small-scale diffusion models. Inference-Time Decomposition of Activations (ITDA) is a recently proposed variant of dictionary learning that takes the dictionary to be a set of data points from the activation distribution and reconstructs them with gradient pursuit. We apply Sparse Autoencoders (SAEs) and ITDA to a large text-to-image diffusion model, Flux 1, and consider the interpretability of embeddings of both by introducing a visual automated interpretation pipeline. We find that SAEs accurately reconstruct residual stream embeddings and beat MLP neurons on interpretability. We are able to use SAE features to steer image generation through activation addition. We find that ITDA has…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kisate/flux-saes-gpu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Softmax · Multi-Head Attention · Attention Is All You Need · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer · Diffusion · Sparse Evolutionary Training