Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

Yahui Liu; Yang Yue; Jingyuan Zhang; Chenxi Sun; Yang Zhou; Wencong Zeng; Ruiming Tang; Guorui Zhou

arXiv:2512.01252·cs.LG·December 2, 2025

Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

Yahui Liu, Yang Yue, Jingyuan Zhang, Chenxi Sun, Yang Zhou, Wencong Zeng, Ruiming Tang, Guorui Zhou

PDF

Open Access 9 Models

TL;DR

This paper explores architectural configurations of Diffusion Mixture-of-Experts models, demonstrating that careful tuning of design factors significantly improves performance and efficiency beyond routing innovations.

Contribution

It systematically studies architectural factors in Diffusion MoE models and provides a practical training recipe that enhances performance with fewer parameters.

Findings

01

Careful architecture tuning yields significant performance gains.

02

Proposed architectures outperform strong baselines.

03

Efficient training recipes enable effective Diffusion MoE models.

Abstract

Recent efforts on Diffusion Mixture-of-Experts (MoE) models have primarily focused on developing more sophisticated routing mechanisms. However, we observe that the underlying architectural configuration space remains markedly under-explored. Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models--including DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. Our systematic study reveals that carefully tuning these configurations is essential for unlocking the full potential of Diffusion MoE models, often yielding gains that exceed those achieved by routing innovations alone. Through extensive experiments, we present novel architectures that can be efficiently applied to both latent and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications