Scaling Diffusion Transformers to 16 Billion Parameters

Zhengcong Fei; Mingyuan Fan; Changqian Yu; Debang Li; Junshi Huang

arXiv:2407.11633·cs.CV·September 10, 2024

Scaling Diffusion Transformers to 16 Billion Parameters

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces DiT-MoE, a sparse diffusion Transformer model that scales to 16.5 billion parameters, achieving state-of-the-art image generation quality with reduced computational costs through expert routing and specialization analysis.

Contribution

The paper presents DiT-MoE, a scalable sparse diffusion Transformer with novel expert routing and balance loss, enabling high-performance image generation at large scale.

Findings

01

DiT-MoE matches dense network performance with less computation.

02

Expert selection varies with spatial position and denoising steps.

03

Scaling to 16.5B parameters achieves new SOTA FID scores.

Abstract

In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

feizc/dit-moe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNuclear physics research studies · Cold Fusion and Nuclear Reactions · Cold Atom Physics and Bose-Einstein Condensates

MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Diffusion · Mixture of Experts · Adam · Dropout