Mixture-of-Experts with Expert Choice Routing

Yanqi Zhou; Tao Lei; Hanxiao Liu; Nan Du; Yanping Huang; and Vincent Zhao; Andrew Dai; Zhifeng Chen; Quoc Le; James Laudon

arXiv:2202.09368·cs.LG·October 17, 2022·59 cites

Mixture-of-Experts with Expert Choice Routing

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, and Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon

PDF

Open Access

TL;DR

This paper introduces a novel expert routing strategy for Mixture-of-Experts models where experts select tokens, leading to improved training speed and better performance on NLP benchmarks compared to traditional fixed routing methods.

Contribution

The paper proposes a heterogeneous MoE with expert choice routing, allowing experts to select tokens, which enhances training efficiency and model performance.

Findings

01

Training convergence time improved by over 2x.

02

Outperforms prior gating methods on GLUE and SuperGLUE benchmarks.

03

Achieves better results than T5 dense model on multiple tasks.

Abstract

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Domain Adaptation and Few-Shot Learning

MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Adam · Label Smoothing · SentencePiece · Position-Wise Feed-Forward Layer · Switch FFN