DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to   Power Next-Generation AI Scale

Samyam Rajbhandari; Conglong Li; Zhewei Yao; Minjia Zhang; Reza; Yazdani Aminabadi; Ammar Ahmad Awan; Jeff Rasley; Yuxiong He

arXiv:2201.05596·cs.LG·July 25, 2022·55 cites

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza, Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He

PDF

Open Access 3 Repos

TL;DR

DeepSpeed-MoE introduces an end-to-end solution for efficient training and inference of large Mixture-of-Experts models, significantly reducing size, latency, and cost, enabling practical deployment of massive sparse models.

Contribution

It presents novel architecture designs, model compression techniques, and an optimized inference system that dramatically improve MoE model efficiency and scalability.

Findings

01

Model size reduced by up to 3.7x

02

Inference latency improved by 7.3x

03

Inference cost reduced by 9x compared to dense models

Abstract

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Speech Recognition and Synthesis