Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory
Bole Ma, Jan Eitzinger, Harald Koestler, Gerhard Wellein

TL;DR
This paper critically evaluates assumptions behind dispatch overhead mitigation techniques in MoE models, revealing that workload characteristics and mock benchmarks often misrepresent real-world routing imbalances.
Contribution
It introduces DODOCO, a comprehensive testing framework that exposes the limitations of current assumptions and benchmarks in modeling routing imbalance in MoE dispatch operations.
Findings
EP scaling has minimal impact on routing imbalance within measurable ranges.
Mock tokens significantly overestimate routing Gini and misrepresent batch-size effects.
Architectures cluster into two stable routing imbalance bands, influencing interconnect design.
Abstract
AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
