Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

Haiying Sha

arXiv:2605.19378·cs.CV·May 20, 2026

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

Haiying Sha

PDF

TL;DR

This paper diagnoses failure modes in sparse Mixture-of-Experts in video Diffusion Transformers, proposes solutions for routing collapse, and outlines a developmental roadmap from visual models to world understanding.

Contribution

It introduces a systematic diagnosis of routing failures, proposes the Functional Redundancy Hypothesis, and provides engineering solutions and a developmental roadmap for sparse MoE models.

Findings

01

Identified five failure modes in sparse MoE routing.

02

Proposed the Functional Redundancy Hypothesis to explain deadlock.

03

Provided engineering solutions for the bfloat16 precision issue.

Abstract

This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.