Mixture of A Million Experts

Xu Owen He

arXiv:2407.04153·cs.LG·July 8, 2024·2 cites

Mixture of A Million Experts

Xu Owen He

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces PEER, a new layer design that efficiently retrieves from over a million tiny experts, enabling massive model scaling with improved performance and computational efficiency in transformer architectures.

Contribution

PEER leverages product key techniques for sparse expert retrieval, allowing scalable use of over a million experts in transformer models.

Findings

01

PEER outperforms dense FFWs in language modeling tasks.

02

PEER achieves better performance-compute trade-offs than coarse-grained MoEs.

03

Enables scaling of transformer models with many tiny experts.

Abstract

The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huyphan168/PEER
pytorch

Models

🤗
ThomasTheMaker/PEER-v1
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCensus and Population Estimation · COVID-19 epidemiological studies · Survey Sampling and Estimation Techniques

MethodsMixture of Experts