Scaling Diffusion Transformers Efficiently via $\mu$P

Chenyu Zheng; Xinyu Zhang; Rongzhen Wang; Wei Huang; Zhi Tian; Weilin Huang; Jun Zhu; Chongxuan Li

arXiv:2505.15270·cs.LG·November 3, 2025

Scaling Diffusion Transformers Efficiently via $\mu$P

Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper extends the Maximal Update Parametrization ($) to diffusion Transformers, enabling efficient large-scale training and hyperparameter transfer, significantly reducing tuning costs and improving performance in vision generative models.

Contribution

It generalizes $ for diffusion Transformers, providing theoretical validation and demonstrating practical benefits like faster convergence and reduced tuning costs in large-scale models.

Findings

01

$$ aligns with diffusion Transformers like U-ViT, DiT, PixArt-$$, and MMDiT.

02

DiT-$$ achieves 2.9x faster convergence.

03

Models under $$ outperform baselines with minimal tuning effort.

Abstract

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ( $μ$ P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $μ$ P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $μ$ P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $μ$ P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt- $α$ , and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $μ$ P methodologies.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ML-GSAI/Scaling-Diffusion-Transformers-muP
pytorchOfficial

Models

🤗
GSAI-ML/DiT-muP
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Neural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion