Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models
Wei Wang, Zhiquan Lai, Shengwei Li, Weijie Liu, Keshi Ge, Ao Shen,, Huayou Su, Dongsheng Li

TL;DR
Pro-Prophet is a systematic load-balancing system designed to improve the efficiency of parallel training for large-scale Mixture of Expert models, significantly reducing training time and balancing device workloads.
Contribution
It introduces a novel planner and scheduler that adaptively optimize load balancing and communication efficiency during MoE model training.
Findings
Achieves up to 2.66x speedup over existing systems.
Enhances load balancing by up to 11.01x.
Reduces communication volume and improves throughput.
Abstract
The size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-consuming. Recently, the Mixture of Expert (MoE) has drawn significant attention as it can scale models to extra-large sizes with a stable computation budget. However, inefficient distributed training of large-scale MoE models hinders their broader application. Specifically, a considerable dynamic load imbalance occurs among devices during training, significantly reducing throughput. Several load-balancing works have been proposed to address the challenge. System-level solutions draw more attention for their hardware affinity and non-disruption of model convergence compared to algorithm-level ones. However, they are troubled by high communication costs and poor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Software System Performance and Reliability · Cloud Computing and Resource Management
