Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules
Xinglin Pan, Wenxiang Lin, Shaohuai Shi, Xiaowen Chu, Weinong Sun, Bo, Li

TL;DR
Parm significantly accelerates large-scale sparsely-activated MoE model training on GPU clusters by optimizing communication schedules, reducing training time and outperforming existing systems like DeepSpeed-MoE.
Contribution
The paper introduces Parm, a system with dedicated communication schedules that improve training efficiency for large MoE models by reducing redundant communication and enabling overlaps.
Findings
Achieves up to 5.77× speedup on 1296 MoE layers.
Approximately 3× faster training on BERT and GPT-2 MoE models.
Outperforms DeepSpeed-MoE in large-scale MoE training scenarios.
Abstract
Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Discriminative Fine-Tuning · Residual Connection · Multi-Head Attention · WordPiece · Softmax
