EMO: Frustratingly Easy Progressive Training of Extendable MoE

Linghao Jin; Chufan Shi; Huijuan Wang; Nuan Wen; Zhengzhong Liu; Eric Xing; Xuezhe Ma

arXiv:2605.13247·cs.LG·May 15, 2026

EMO: Frustratingly Easy Progressive Training of Extendable MoE

Linghao Jin, Chufan Shi, Huijuan Wang, Nuan Wen, Zhengzhong Liu, Eric Xing, Xuezhe Ma

PDF

TL;DR

EMO introduces a progressive training method for MoE models that expands expert capacity over time, improving efficiency without sacrificing performance.

Contribution

It proposes a simple framework that dynamically grows the expert pool during training, addressing efficiency issues in large-scale MoE models.

Findings

01

EMO matches fixed-expert performance in large-scale experiments.

02

EMO reduces training time and GPU costs.

03

It effectively scales MoE training by progressive expert expansion.

Abstract

Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.