Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

Xun Sun; Shaoyuan Chen; Pingchuan Ma; Yue Chen; Ziwei Yuan; Zhanhao Cao; Han Han; Shangming Cai; Teng Ma; Xuchun Shang; Xinpeng Zhao; Ke Yang; Junlin Wei; Lianzhi Lin; Yuji Liu; Feng Ren; Haoran Hu; Cheng Wan; Yingdi Shan; Yongwei Wu; Mingxing Zhang

arXiv:2605.10670·cs.DC·May 12, 2026

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

Xun Sun, Shaoyuan Chen, Pingchuan Ma, Yue Chen, Ziwei Yuan, Zhanhao Cao, Han Han, Shangming Cai, Teng Ma, Xuchun Shang, Xinpeng Zhao, Ke Yang, Junlin Wei, Lianzhi Lin, Yuji Liu, Feng Ren, Haoran Hu, Cheng Wan, Yingdi Shan, Yongwei Wu, Mingxing Zhang

PDF

TL;DR

This paper introduces EEP, a system that enables live partial failure recovery in wide expert-parallel MoE inference, maintaining high throughput and reducing downtime during rank failures.

Contribution

EEP provides a mutable runtime membership model that repairs and reintegrates failed ranks without full system rebuilds, improving fault tolerance in MoE serving.

Findings

01

EEP maintains within 4.4% of baseline throughput under static serving.

02

It reduces recovery time from 348s to 52s after a rank failure.

03

EEP turns a full downtime into bounded, manageable interruptions.

Abstract

Mixture-of-Experts (MoE) serving relies on wide expert parallelism (EP) to aggregate the memory capacity and bandwidth of many GPUs within one inference instance. This efficiency comes with a systems cost: every decoding step depends on token dispatch and combination across all active EP ranks, so even one rank failure can disrupt the entire service. Existing EP stacks handle such failures poorly because they treat membership as a fixed configuration established at initialization. The same rank set determines communicator state, expert placement, and the routing metadata baked into CUDA execution graphs, leaving the system with no way to shrink around a failure while keeping the instance valid. This paper argues that partial-failure tolerance should instead be formulated as a live EP validity problem. We present EEP, a communication and runtime substrate that represents membership as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.