Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

TL;DR
This paper introduces a scaling law for Mixture-of-Experts models that predicts their efficiency based on configuration, validated by training a new model that matches larger dense models with less compute.
Contribution
The paper develops a unified scaling law for MoE models' efficiency leveraging empirical data from over 300 models, enabling better prediction of model capacity and resource use.
Findings
EL is driven by expert activation ratio and compute budget following power laws.
Expert granularity has a non-linear effect with an optimal range.
A pilot MoE model matched larger dense models' performance with less compute.
Abstract
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear…
Peer Reviews
Decision·ICLR 2026 Poster
- The study is of large scale (20B moe and 1T tokens) and detailed ablation studies justify their choices of architecture. - A principled approach (scaling law) is used to tune the LR and batch size, further solidifying the argument for the proposed leverage scaling law. - While the optimal configuration is quite impossible due to many hyperparameters, this paper still provides valuable and principled insights into how to scale the MoE.
- The experimental design for analyzing the Activation Ratio (A) (Section F.1, Table 8) appears structurally limited as it holds the active computational cost nearly constant. - The study varies the Activation Ratio (sparsity) exclusively by changing the total number of routable experts (E) while fixing the number of activated experts (Ea=2) and shared experts (Es=1). This approach means the study explores *only* the effect of increasing total parameters without changing the active expert. - F
1. This work's originality stems from its formulation of "Efficiency Leverage" (EL), a clear metric to quantify MoE computational advantage . It uses this concept to build a unified scaling law for EL, connecting it to the compute budget, activation ratio, and granularity. 2. The empirical quality is high, supported by over 300 trained models. A key strength is the preliminary work in Section 2, which derives MoE-specific scaling laws for optimal hyperparameters and data allocation . This step
1. The conclusion that efficiency monotonically increases with sparsity (Key Takeaway 1) is based on a theoretical FLOPs model. This omits the practical wall-clock costs of routing, communication (e.g., all-to-all), and memory bandwidth for loading many distinct expert weights, which can become bottlenecks at high sparsity. 2. The functional forms for the scaling laws (Eq. 2, 3, 4, and 5) are presented without strong justification. It is not clear why these specific complex forms (e.g., log-pol
**Conceptual contribution:** EL has the potential to be a useful metric that simplifies efficiency comparisons between MoE and dense architectures. However, the metric needs to be precisely defined, and its usage throughout the paper must remain consistent with that definition, with any claims properly justified. **Empirical validation:** The validation experiments successfully corroborate the paper's claims. The authors train models at significantly larger scales than those used to fit the s
**Major Issue: Imprecise definition of EL and its inconsistent use throughout the paper, lack of clarity in experimental setup** The definition of Efficiency Leverage (lines 186-188) lacks precision, and its usage throughout the paper creates confusion for readers. Several issues arise: 1. **Unclear functional form**: The formal definition of EL makes it a function of three arguments: MoE architecture, MoE compute budget, and dense architecture. However, the authors' usage throughout the text
Code & Models
- 🤗inclusionAI/Ling-flash-2.0model· 904 dl· ♡ 212904 dl♡ 212
- 🤗inclusionAI/Ling-mini-2.0model· 16k dl· ♡ 19016k dl♡ 190
- 🤗inclusionAI/Ling-mini-base-2.0model· 235 dl· ♡ 23235 dl♡ 23
- 🤗inclusionAI/Ling-mini-base-2.0-5Tmodel· 910 dl· ♡ 6910 dl♡ 6
- 🤗inclusionAI/Ling-mini-base-2.0-10Tmodel· 13 dl· ♡ 613 dl♡ 6
- 🤗inclusionAI/Ling-mini-base-2.0-15Tmodel· 7 dl· ♡ 37 dl♡ 3
- 🤗inclusionAI/Ling-mini-base-2.0-20Tmodel· 205 dl· ♡ 11205 dl♡ 11
- 🤗inclusionAI/Ling-flash-base-2.0model· 279 dl· ♡ 31279 dl♡ 31
- 🤗inclusionAI/Ling-1Tmodel· 2.1k dl· ♡ 5322.1k dl♡ 532
- 🤗inclusionAI/Ling-1T-FP8model· 1.5k dl· ♡ 91.5k dl♡ 9
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Speech and dialogue systems
