Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Yujie Wei; Shiwei Zhang; Hangjie Yuan; Yujin Han; Zhekai Chen; Jiayu Wang; Difan Zou; Xihui Liu; Yingya Zhang; Yu Liu; Hongming Shan

arXiv:2510.24711·cs.CV·March 3, 2026

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan

PDF

1 Models 3 Reviews

TL;DR

This paper introduces ProMoE, a novel MoE framework with explicit routing guidance for diffusion transformers, significantly improving expert specialization and performance in vision tasks like ImageNet classification.

Contribution

ProMoE employs a two-step routing process with conditional and prototypical routing, incorporating semantic guidance and a contrastive loss to enhance expert specialization in vision MoE.

Findings

01

ProMoE outperforms state-of-the-art methods on ImageNet.

02

Explicit routing guidance improves expert specialization.

03

Prototypical routing enhances intra-expert coherence and inter-expert diversity.

Abstract

Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The rationale for separating conditional and unconditional tokens is clear and well-founded. 2. The investigation into routing guidance and load balancing is insightful and valuable.

Weaknesses

1. It would be beneficial to include ablation studies on dense models with conditional routing to determine whether the performance gain stems solely from conditional routing itself or requires combination with routing enhancements. 2. Since one key advantage of MoE models is improved computational efficiency, the authors are encouraged to report training and inference times, as well as FLOPs, in comparison to both dense models and other MoE variants.

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper effectively addresses the core challenges of visual token redundancy and functional heterogeneity in Diffusion Transformers, introducing mechanisms that enable true expert specialization within the Mixture-of-Experts framework. 2. The proposed method demonstrates strong and consistent scaling behavior across multiple model sizes, validating its robustness and efficiency under both Rectified Flow and DDPM training paradigms.

Weaknesses

1. The experiments are conducted solely on ImageNet-1K for class-conditional generation, without evaluations on other datasets or modalities, which limits the evidence of generalization. 2. The paper does not report quantitative expert utilization, such as the proportion of tokens or capacity per expert, making it hard to assess balance and specialization.

Reviewer 03Rating 6Confidence 4

Strengths

- This paper clearly diagnoses the problem of vision MoE and proposes an innovative ProMoE to solve it. - The ProMoE achieves validated, state-of-the-art results on the ImageNet benchmark. - The presentation is clear and easy to understand.

Weaknesses

- What is the fundamental difference between prototypical routing and conventional MoE routing mechanisms, such as one using a standard linear layer? The paper introduces "learnable prototypes", but this seems functionally very similar to using the learnable weights of a linear layer to calculate token-expert affinities. Could you clarify what makes this prototypical approach a genuine innovation, rather than just a conceptual re-framing of a standard linear gating mechanism? - The routing mech

Code & Models

Models

🤗
weilllllls/ProMoE
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.