How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Leena Chennuru Vankadara; Moritz Haas; Luke Hayward; Sebastian Bordt; Alessandro Breccia

arXiv:2605.14200·cs.LG·May 15, 2026

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Leena Chennuru Vankadara, Moritz Haas, Luke Hayward, Sebastian Bordt, Alessandro Breccia

PDF

Abstract

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$ , expert width $N_{e}$ , number of experts $M$ , sparsity $K$ , and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N ≍ N_{e}$ , (II) co-scaling $N ≍ M ≍ K$ , and (III) full proportional scaling of $N, N_{e}, M$ , and $K$ . For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ( $μ$ ) desiderata. We then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.