FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui, Wang, Bo Li, Xiaowen Chu

TL;DR
FSMoE is a new flexible training system for sparse mixture-of-experts models that improves efficiency and scalability through innovative task scheduling, communication optimization, and adaptive techniques, outperforming existing systems.
Contribution
Introduces FSMoE, a versatile and efficient training system for MoE models with novel scheduling, communication, and adaptive gradient partitioning techniques.
Findings
Supports four MoE routing functions with up to 1.42× speedup.
Outperforms DeepSpeed-MoE and Tutel by 1.18×-1.22× on 1458 MoE layers.
Achieves 1.19×-3.01× speedup on real-world MoE models.
Abstract
Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Softmax · Residual Connection · Dropout · Byte Pair Encoding · Attention Dropout · Linear Layer
