DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

Zhichen Zeng; Chi-Chih Chang; Jiayi Wang; Zezhou Wang; Ningxin Zheng; Zheng Zhong; Cesar A. Stuardo; Dongyang Wang; Mohamed S. Abdelfattah; Haibin Lin; Banghua Zhu; Ang Li; Ziheng Jiang

arXiv:2605.11005·cs.LG·May 13, 2026

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

Zhichen Zeng, Chi-Chih Chang, Jiayi Wang, Zezhou Wang, Ningxin Zheng, Zheng Zhong, Cesar A. Stuardo, Dongyang Wang, Mohamed S. Abdelfattah, Haibin Lin, Banghua Zhu, Ang Li, Ziheng Jiang

PDF

TL;DR

DisagMoE is a novel MoE training system that optimizes model placement and scheduling by disaggregating attention and FFN layers, significantly improving training efficiency for large language models.

Contribution

It introduces a disaggregated training approach with multi-stage pipeline and bandwidth balancing, addressing communication bottlenecks in MoE training.

Findings

01

Achieves up to 1.8x speedup on 16-node clusters.

02

Effectively balances GPU and network bandwidth.

03

Improves training efficiency for large MoE models.

Abstract

Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers' computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.