Mixture-of-Experts with Expert Choice Routing
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, and Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon

TL;DR
This paper introduces a novel expert routing strategy for Mixture-of-Experts models where experts select tokens, leading to improved training speed and better performance on NLP benchmarks compared to traditional fixed routing methods.
Contribution
The paper proposes a heterogeneous MoE with expert choice routing, allowing experts to select tokens, which enhances training efficiency and model performance.
Findings
Training convergence time improved by over 2x.
Outperforms prior gating methods on GLUE and SuperGLUE benchmarks.
Achieves better results than T5 dense model on multiple tasks.
Abstract
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Domain Adaptation and Few-Shot Learning
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Adam · Label Smoothing · SentencePiece · Position-Wise Feed-Forward Layer · Switch FFN
