Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation
Fahao Chen, Peng Li, Zicong Hong, Zhou Su, Song Guo

TL;DR
Luffy is a new distributed MoE training system that reduces inter-GPU communication by migrating sequences and condensing tokens, achieving significant speedups while maintaining high parallelism.
Contribution
Luffy introduces sequence migration and token condensation techniques to improve communication efficiency in distributed MoE training.
Findings
Achieves up to 2.73x speedup over state-of-the-art systems.
Effectively reduces inter-GPU traffic without sacrificing parallelism.
Demonstrates scalability on 16 V100 GPUs.
Abstract
Mixture-of-Experts (MoE) is an emerging technique for scaling large models with sparse activation. MoE models are typically trained in a distributed manner with an expert parallelism scheme, where experts in each MoE layer are distributed across multiple GPUs. However, the default expert parallelism suffers from the heavy network burden due to the all-to-all intermediate data exchange among GPUs before and after the expert run. Some existing works have proposed to reduce intermediate data exchanges by transferring experts to reduce the network loads, however, which would decrease parallelism level of expert execution and make computation inefficient. The weaknesses of existing works motivate us to explore whether it is possible to reduce inter-GPU traffic while maintaining a high degree of expert parallelism. This paper gives a positive response by presenting Luffy, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications
