Parm: Efficient Training of Large Sparsely-Activated Models with   Dedicated Schedules

Xinglin Pan; Wenxiang Lin; Shaohuai Shi; Xiaowen Chu; Weinong Sun; Bo; Li

arXiv:2407.00599·cs.DC·July 4, 2024

Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

Xinglin Pan, Wenxiang Lin, Shaohuai Shi, Xiaowen Chu, Weinong Sun, Bo, Li

PDF

Open Access 1 Repo

TL;DR

Parm significantly accelerates large-scale sparsely-activated MoE model training on GPU clusters by optimizing communication schedules, reducing training time and outperforming existing systems like DeepSpeed-MoE.

Contribution

The paper introduces Parm, a system with dedicated communication schedules that improve training efficiency for large MoE models by reducing redundant communication and enabling overlaps.

Findings

01

Achieves up to 5.77× speedup on 1296 MoE layers.

02

Approximately 3× faster training on BERT and GPT-2 MoE models.

03

Outperforms DeepSpeed-MoE in large-scale MoE training scenarios.

Abstract

Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Fragile-azalea/Parm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Discriminative Fine-Tuning · Residual Connection · Multi-Head Attention · WordPiece · Softmax