A Review of Sparse Expert Models in Deep Learning
William Fedus, Jeff Dean, Barret Zoph

TL;DR
Sparse expert models, a concept from thirty years ago, are now central in deep learning, enabling large, efficient models across multiple domains by acting on only subsets of parameters per example.
Contribution
This paper reviews the evolution, algorithms, and recent advances of sparse expert models, highlighting their significance and future research directions.
Findings
Sparse expert models enable large-scale, efficient deep learning.
They have achieved significant improvements in NLP, vision, and speech.
The paper identifies key areas for future research in sparse models.
Abstract
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Expert finding and Q&A systems · Domain Adaptation and Few-Shot Learning
MethodsBalanced Selection
