Exploring Transformer Backbones for Image Diffusion Models
Princy Chahal

TL;DR
This paper introduces a Transformer-based Latent Diffusion model for image synthesis, demonstrating comparable performance to UNet architectures and enabling easier integration of text and image data through multi-head attention.
Contribution
It presents the first end-to-end Transformer architecture for Latent Diffusion models, simplifying design and enhancing multimodal data fusion capabilities.
Findings
Achieves 14.1 FID on ImageNet, comparable to UNet-based models.
Enables direct interaction between text and image features without cross-attention.
Simplifies architecture while maintaining competitive image synthesis quality.
Abstract
We present an end-to-end Transformer based Latent Diffusion model for image synthesis. On the ImageNet class conditioned generation task we show that a Transformer based Latent Diffusion model achieves a 14.1FID which is comparable to the 13.1FID score of a UNet based architecture. In addition to showing the application of Transformer models for Diffusion based image synthesis this simplification in architecture allows easy fusion and modeling of text and image data. The multi-head attention mechanism of Transformers enables simplified interaction between the image and text features which removes the requirement for crossattention mechanism in UNet based Diffusion models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Latent Diffusion Model · Absolute Position Encodings · Linear Layer · Adam · Layer Normalization · Softmax · Byte Pair Encoding · Residual Connection
