Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock
Haiying Sha

TL;DR
This paper diagnoses failure modes in sparse Mixture-of-Experts in video Diffusion Transformers, proposes solutions for routing collapse, and outlines a developmental roadmap from visual models to world understanding.
Contribution
It introduces a systematic diagnosis of routing failures, proposes the Functional Redundancy Hypothesis, and provides engineering solutions and a developmental roadmap for sparse MoE models.
Findings
Identified five failure modes in sparse MoE routing.
Proposed the Functional Redundancy Hypothesis to explain deadlock.
Provided engineering solutions for the bfloat16 precision issue.
Abstract
This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
