Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Hao Li; Shamit Lal; Zhiheng Li; Yusheng Xie; Ying Wang; Yang Zou,; Orchid Majumder; R. Manmatha; Zhuowen Tu; Stefano Ermon; Stefano Soatto,; Ashwin Swaminathan

arXiv:2412.12391·cs.CV·December 18, 2024

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou,, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto,, Ashwin Swaminathan

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the scaling behavior of Diffusion Transformers for text-to-image generation, demonstrating that a pure self-attention model scales effectively and outperforms some existing models when scaled properly.

Contribution

It introduces the U-ViT model, a self-attention based DiT that scales better and is simpler than cross-attention variants, and provides empirical insights on data scaling effects.

Findings

01

U-ViT outperforms SDXL UNet and other DiT variants at 2.3B parameters.

02

Scaling dataset size and caption quality improves performance and learning efficiency.

03

Pure self-attention DiT models are more effective and scalable than cross-attention variants.

Abstract

We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

Extensive experiments of models with different architectures, different numbers of parameters and different dataset settings are conducted. Given the fact that transformer-based diffusion models are important and popular in both image and video generations. Practical findings from this paper could contribute to the community and help other researchers in future model architecture design and model training.

Weaknesses

Although the paper provides some valuable experience and findings, it lacks in-depth analysis. I appreciate the author's efforts in conducting comprehensive experiments and offering the practical findings, but only performing ablation studies based on existing model designs might be a weakness of a research paper. In experiments, some models are not trained with the same number of steps. As a result, we can compare their performance at the early stages of training, but whether the comparison w

Reviewer 02Rating 6Confidence 4

Strengths

- This paper conducted extensive experiments to examine the model scaling properties of different kinds of models, including UNet based ones like SD2 and SDXL, and transformer based ones like UViT, LargeDiT, and PixArt-$\alpha$. - This paper investigated many scaling perspectives, including model size scaling, data size scaling, caption scaling, token number scaling, and so on. - This paper delivered a message that UViT models have better scaling properties, which can provide a reference for fut

Weaknesses

- Despite the extensive study, there seem no novel technical contributions within this paper. - The analysis regarding why long caption enhancement and dataset scaling help to improve the text-image alignment performance seems not thorough enough.

Reviewer 03Rating 3Confidence 2

Strengths

The research toppic is popular The paper is well-written

Weaknesses

There is not much to take in this paper. The paper seems to be incomplet: 1) important metrics like FID and IS are not used in the paper 2) although the authors study the scalling of DITs, they didn't privide a stonger version of DIT. Hence, they didn't demenstrate the reliablity of their paper. 3) Conclusions from the paper like:'Finetuning text encoder improves the convergence speed for UNet' are not important. Because finetuning usually improves the performance.. 4) what's more, contributio

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Computer Graphics and Visualization Techniques

MethodsDiffusion