Slicing and Dicing: Configuring Optimal Mixtures of Experts

Margaret Li; Sneha Kudugunta; Danielle Rothermel; Luke Zettlemoyer

arXiv:2605.11689·cs.LG·May 13, 2026

Slicing and Dicing: Configuring Optimal Mixtures of Experts

Margaret Li, Sneha Kudugunta, Danielle Rothermel, Luke Zettlemoyer

PDF

TL;DR

This paper systematically studies mixture-of-experts architectures in large language models, revealing that optimizing expert count and granularity is most impactful, while other design choices have minimal effects.

Contribution

First comprehensive analysis of over 2,000 pretraining runs exploring interactions of MoE design choices, highlighting the importance of expert count and granularity.

Findings

01

Performance improves with more MoE parameters across scales.

02

Optimal expert size depends only on active parameter count.

03

Dropless routing consistently improves performance.

Abstract

Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.