Accelerating Distributed MoE Training and Inference with Lina

Jiamin Li; Yimin Jiang; Yibo Zhu; Cong Wang; Hong Xu

arXiv:2210.17223·cs.DC·April 30, 2024·1 cites

Accelerating Distributed MoE Training and Inference with Lina

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, Hong Xu

PDF

Open Access

TL;DR

This paper introduces Lina, a system that significantly accelerates distributed MoE training and inference by optimizing all-to-all communication, enabling larger models to be trained and served more efficiently.

Contribution

Lina systematically analyzes all-to-all communication bottlenecks in distributed MoE and proposes a novel scheduling approach to mitigate these issues.

Findings

01

Lina reduces training step time by up to 1.73x.

02

Lina decreases 95th percentile inference time by an average of 1.63x.

03

The system effectively balances transfer size and bandwidth during inference.

Abstract

Scaling model parameters improves model quality at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have sub-linear scaling of computation cost with model size, thus providing opportunities to train and serve a larger model at lower cost than their dense counterparts. However, distributed MoE training and inference is inefficient, mainly due to the interleaved all-to-all communication during model computation. This paper makes two main contributions. First, we systematically analyze all-to-all overhead in distributed MoE and present the main causes for it to be the bottleneck in training and inference, respectively. Second, we design and build Lina to address the all-to-all bottleneck head-on. Lina opportunistically prioritizes all-to-all over the concurrent allreduce whenever feasible using tensor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Traffic Prediction and Management Techniques · Context-Aware Activity Recognition Systems