Beyond Distillation: Task-level Mixture-of-Experts for Efficient   Inference

Sneha Kudugunta; Yanping Huang; Ankur Bapna; Maxim Krikun; Dmitry; Lepikhin; Minh-Thang Luong; Orhan Firat

arXiv:2110.03742·cs.CL·October 11, 2021

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry, Lepikhin, Minh-Thang Luong, Orhan Firat

PDF

Open Access

TL;DR

This paper introduces task-level routing in Mixture-of-Experts models, enabling efficient inference with smaller sub-networks that retain high performance, outperforming token-level routing and distillation methods in multilingual translation tasks.

Contribution

It proposes a novel task-level routing strategy for MoE models that produces smaller, high-performing sub-networks suitable for deployment without distillation.

Findings

01

Task-MoE outperforms token-MoE by +1.0 BLEU on WMT.

02

Inference throughput improves by up to 2.6x with task routing.

03

Task-MoE preserves all BLEU gains without additional inference costs.

Abstract

Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques