Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
Qijun Zhang, Chen Zhang, Zhuoshan Zhou, Haibo Wang, Zhe Zhou, Zhipeng Tu, Guangyu Sun, Zhiyao Xie, Yijia Diao, Zhigang Ji, Jingwen Leng, Guanghui He, Minyi Guo

TL;DR
This paper introduces DySHARP, a dynamic in-switch computing solution for MoE models on multi-GPU systems, reducing redundant communication and achieving up to 1.79× speedup.
Contribution
DySHARP extends NVLink SHARP with dynamic communication primitives and token-centric kernel fusion to optimize MoE expert parallelism on multi-GPU systems.
Findings
DySHARP reduces redundant inter-GPU traffic in MoE.
DySHARP achieves up to 1.79× speedup over state-of-the-art.
Dynamic communication primitives improve MoE performance.
Abstract
Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial redundant inter-GPU data transfers in MoE that can be potentially addressed by in-switch computing. Unfortunately, the existing solution, NVLink SHARP (NVLS), can only support static collectives with regular patterns, incapable of dynamic communication with irregular patterns in MoE. To bridge the functionality gap, we propose DySHARP, an integral dynamic in-switch computing solution to accelerate MoE, encompassing both communication primitives and communication-aware scheduling: 1) Dynamic multimem addressing co-designs ISA, architecture, and runtime, as a dynamic extension to NVLS, reducing redundant traffic. However, the resulting traffic reduction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
