FastMoE: A Fast Mixture-of-Expert Training System
Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, Jie Tang

TL;DR
FastMoE is a high-performance, open-source distributed MoE training system built on PyTorch, enabling scalable, efficient training of trillion-parameter language models across multiple GPUs and nodes.
Contribution
It introduces a flexible, GPU-compatible MoE training system with optimized acceleration techniques and hierarchical interfaces for model design and adaptation.
Findings
Supports linear scaling of experts with GPUs
Enables training of trillion-parameter models
Optimized for high-performance distributed training
Abstract
Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google's hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities. In this paper, we present FastMoE, a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
Methods102 Ways to Reach To Someone At Expedia by Phone: Step-by-Step Guide · Linear Layer · Someone at Southwest Airlines Via Phone, Email, Or Chat Options: A Step by Step Guide · Softmax · Multi-Head Attention · Attention Is All You Need · Adaptive Softmax · Adaptive Input Representations · Residual Connection · Adam
