Scaling Diffusion Transformers Efficiently via $\mu$P
Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li

TL;DR
This paper extends the Maximal Update Parametrization ($) to diffusion Transformers, enabling efficient large-scale training and hyperparameter transfer, significantly reducing tuning costs and improving performance in vision generative models.
Contribution
It generalizes $ for diffusion Transformers, providing theoretical validation and demonstrating practical benefits like faster convergence and reduced tuning costs in large-scale models.
Findings
$$ aligns with diffusion Transformers like U-ViT, DiT, PixArt-$$, and MMDiT.
DiT-$$ achieves 2.9x faster convergence.
Models under $$ outperform baselines with minimal tuning effort.
Abstract
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing P methodologies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Neural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion
