HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission

Weihao Yang; Hao Huang; Donglei Wu; Ningke Li; Yanqi Pan; Qiyang Zheng; Wen Xia; Shiyi Li; Qiang Wang

arXiv:2510.19470·cs.DC·October 23, 2025

HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission

Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, Qiang Wang

PDF

Open Access

TL;DR

HybridEP introduces a dynamic, model-guided framework for expert parallelism in MoE models, significantly improving scalability and training speed across multiple data centers with limited bandwidth.

Contribution

It proposes a novel hybrid expert/data transmission approach with a stream-based model and topology optimization techniques to enhance cross-DC MoE training scalability.

Findings

01

HybridEP outperforms state-of-the-art systems by up to 5.6x under bandwidth constraints.

02

Achieves up to 1.45x speedup with 1000 data centers in simulations.

03

Effectively reduces communication overhead in low-bandwidth, cross-DC MoE training.

Abstract

Mixture-of-Experts (MoE) has become a popular architecture for scaling large models. However, the rapidly growing scale outpaces model training on a single DC, driving a shift toward a more flexible, cross-DC training paradigm. Under this, Expert Parallelism (EP) of MoE faces significant scalability issues due to the limited cross-DC bandwidth. Specifically, existing EP optimizations attempt to overlap data communication and computation, which has little benefit in low-bandwidth scenarios due to a much longer data communication time. Therefore, the trends of cross-DC EP scaling is fast becoming a critical roadblock to the continued growth of MoE models. To address this, we propose HybridEP, a modeling-guided framework to optimize EP under constrained bandwidth. Our key idea is to dynamically transform the spatial placement of experts to reduce data communication traffic and frequency,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques