Scaling Diffusion Transformers to 16 Billion Parameters
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang

TL;DR
This paper introduces DiT-MoE, a sparse diffusion Transformer model that scales to 16.5 billion parameters, achieving state-of-the-art image generation quality with reduced computational costs through expert routing and specialization analysis.
Contribution
The paper presents DiT-MoE, a scalable sparse diffusion Transformer with novel expert routing and balance loss, enabling high-performance image generation at large scale.
Findings
DiT-MoE matches dense network performance with less computation.
Expert selection varies with spatial position and denoising steps.
Scaling to 16.5B parameters achieves new SOTA FID scores.
Abstract
In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear physics research studies · Cold Fusion and Nuclear Reactions · Cold Atom Physics and Bose-Einstein Condensates
MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Diffusion · Mixture of Experts · Adam · Dropout
