TL;DR
EMO introduces a pretraining approach for Mixture-of-Experts models that promotes emergent, domain-specific modularity, enabling efficient expert subset selection with minimal performance loss.
Contribution
The paper presents EMO, a novel MoE pretraining method that encourages domain-based expert grouping without human priors, improving modularity and memory efficiency.
Findings
EMO matches standard MoE performance when fully activated.
Retaining only 25% of experts causes only 1% performance drop.
Expert subsets in EMO specialize at semantic, domain-level distinctions.
Abstract
Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗allenai/Dense_1b_130Bmodel· 384 dl· ♡ 3384 dl♡ 3
- 🤗allenai/Emo_1b14b_1Tmodel· 747 dl· ♡ 21747 dl♡ 21
- 🤗allenai/StdMoE_1b4b_130Bmodel· 830 dl· ♡ 3830 dl♡ 3
- 🤗allenai/StdMoE_1b14b_1Tmodel· 129 dl· ♡ 2129 dl♡ 2
- 🤗allenai/EMOmodel· 103 dl· ♡ 5103 dl♡ 5
- 🤗georgesZam/emo-1b14b-1t-4bitmodel· 39 dl· ♡ 139 dl♡ 1
- 🤗allenai/Emo_1b14b_130Bmodel· 190 dl· ♡ 3190 dl♡ 3
- 🤗allenai/StdMoE_1b14b_130Bmodel· 103 dl· ♡ 1103 dl♡ 1
- 🤗allenai/StdMoE_1b14b_1T_Preannealmodel· 73 dl· ♡ 173 dl♡ 1
- 🤗allenai/StdMoE_1b14b_1T_EmoAnnealedmodel· 78 dl· ♡ 178 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
