FSMoE: A Flexible and Scalable Training System for Sparse   Mixture-of-Experts Models

Xinglin Pan; Wenxiang Lin; Lin Zhang; Shaohuai Shi; Zhenheng Tang; Rui; Wang; Bo Li; Xiaowen Chu

arXiv:2501.10714·cs.LG·January 22, 2025

FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui, Wang, Bo Li, Xiaowen Chu

PDF

TL;DR

FSMoE is a new flexible training system for sparse mixture-of-experts models that improves efficiency and scalability through innovative task scheduling, communication optimization, and adaptive techniques, outperforming existing systems.

Contribution

Introduces FSMoE, a versatile and efficient training system for MoE models with novel scheduling, communication, and adaptive gradient partitioning techniques.

Findings

01

Supports four MoE routing functions with up to 1.42× speedup.

02

Outperforms DeepSpeed-MoE and Tutel by 1.18×-1.22× on 1458 MoE layers.

03

Achieves 1.19×-3.01× speedup on real-world MoE models.

Abstract

Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Softmax · Residual Connection · Dropout · Byte Pair Encoding · Attention Dropout · Linear Layer