EMO: Pretraining Mixture of Experts for Emergent Modularity

Ryan Wang; Akshita Bhagia; Sewon Min

arXiv:2605.06663·cs.CL·May 12, 2026

EMO: Pretraining Mixture of Experts for Emergent Modularity

Ryan Wang, Akshita Bhagia, Sewon Min

PDF

1 Repo 12 Models

TL;DR

EMO introduces a pretraining approach for Mixture-of-Experts models that promotes emergent, domain-specific modularity, enabling efficient expert subset selection with minimal performance loss.

Contribution

The paper presents EMO, a novel MoE pretraining method that encourages domain-based expert grouping without human priors, improving modularity and memory efficiency.

Findings

01

EMO matches standard MoE performance when fully activated.

02

Retaining only 25% of experts causes only 1% performance drop.

03

Expert subsets in EMO specialize at semantic, domain-level distinctions.

Abstract

Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/EMO
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.