Slicing and Dicing: Configuring Optimal Mixtures of Experts
Margaret Li, Sneha Kudugunta, Danielle Rothermel, Luke Zettlemoyer

TL;DR
This paper systematically studies mixture-of-experts architectures in large language models, revealing that optimizing expert count and granularity is most impactful, while other design choices have minimal effects.
Contribution
First comprehensive analysis of over 2,000 pretraining runs exploring interactions of MoE design choices, highlighting the importance of expert count and granularity.
Findings
Performance improves with more MoE parameters across scales.
Optimal expert size depends only on active parameter count.
Dropless routing consistently improves performance.
Abstract
Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
