Tutel: Adaptive Mixture-of-Experts at Scale
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu,, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng,, Fan Yang, Mao Yang, Yongqiang Xiong

TL;DR
Tutel introduces Flex, a scalable system for Mixture-of-Experts models that dynamically adapts parallelism and pipelining during runtime, significantly improving speed and efficiency in training and inference.
Contribution
Flex's design enables zero-cost adaptive parallelism and pipelining for MoE models, enhancing scalability and performance without additional overhead.
Findings
Achieves up to 5.75x speedup over previous methods.
Accelerates SwinV2-MoE training and inference by over 1.5x.
Improves model accuracy in vision tasks.
Abstract
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Layer Normalization · Byte Pair Encoding · Stochastic Depth · Adam · Label Smoothing · Swin Transformer
