EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice   Routing

Haotian Sun; Tao Lei; Bowen Zhang; Yanghao Li; Haoshuo Huang; Ruoming; Pang; Bo Dai; Nan Du

arXiv:2410.02098·cs.CV·March 5, 2025

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming, Pang, Bo Dai, Nan Du

PDF

Open Access

TL;DR

EC-DIT introduces an adaptive Mixture-of-Experts approach for diffusion transformers, enabling efficient scaling to 97 billion parameters and improving text-to-image synthesis quality through heterogeneous compute allocation.

Contribution

The paper presents EC-DIT, a novel adaptive expert-choice routing method for diffusion transformers that enhances scalability and performance in text-to-image synthesis.

Findings

01

Achieved state-of-the-art GenEval score of 71.68%.

02

Enabled scaling of models up to 97 billion parameters.

03

Demonstrated improved training convergence and image quality.

Abstract

Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpinion Dynamics and Social Influence

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Mixture of Experts · Diffusion