FastMoE: A Fast Mixture-of-Expert Training System

Jiaao He; Jiezhong Qiu; Aohan Zeng; Zhilin Yang; Jidong Zhai; Jie Tang

arXiv:2103.13262·cs.LG·March 25, 2021·39 cites

FastMoE: A Fast Mixture-of-Expert Training System

Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, Jie Tang

PDF

Open Access 3 Repos

TL;DR

FastMoE is a high-performance, open-source distributed MoE training system built on PyTorch, enabling scalable, efficient training of trillion-parameter language models across multiple GPUs and nodes.

Contribution

It introduces a flexible, GPU-compatible MoE training system with optimized acceleration techniques and hierarchical interfaces for model design and adaptation.

Findings

01

Supports linear scaling of experts with GPUs

02

Enables training of trillion-parameter models

03

Optimized for high-performance distributed training

Abstract

Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google's hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities. In this paper, we present FastMoE, a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

Methods102 Ways to Reach To Someone At Expedia by Phone: Step-by-Step Guide · Linear Layer · Someone at Southwest Airlines Via Phone, Email, Or Chat Options: A Step by Step Guide · Softmax · Multi-Head Attention · Attention Is All You Need · Adaptive Softmax · Adaptive Input Representations · Residual Connection · Adam