Mixture of A Million Experts
Xu Owen He

TL;DR
This paper introduces PEER, a new layer design that efficiently retrieves from over a million tiny experts, enabling massive model scaling with improved performance and computational efficiency in transformer architectures.
Contribution
PEER leverages product key techniques for sparse expert retrieval, allowing scalable use of over a million experts in transformer models.
Findings
PEER outperforms dense FFWs in language modeling tasks.
PEER achieves better performance-compute trade-offs than coarse-grained MoEs.
Enables scaling of transformer models with many tiny experts.
Abstract
The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCensus and Population Estimation · COVID-19 epidemiological studies · Survey Sampling and Estimation Techniques
MethodsMixture of Experts
