Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization
Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin

TL;DR
This paper introduces a comprehensive framework for optimizing Mixture-of-Experts architectures using holistic scaling laws, enabling precise resource allocation and flexible model design across a wide range of compute budgets.
Contribution
It develops a reusable, low-dimensional optimization framework that accurately maps compute budgets to optimal MoE architectures, addressing limitations of previous scaling studies.
Findings
FLOPs per token is insufficient as a fairness metric for MoE models.
The framework produces robust scaling laws validated across hundreds of models.
Near-optimal configuration flexibility increases with model scale.
Abstract
Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Big Data and Digital Economy · Advanced Multi-Objective Optimization Algorithms
