TL;DR
This paper demonstrates that properly optimized Mixture-of-Experts models can outperform dense language models when constrained to equal total parameters, compute, and data, validated by extensive experiments.
Contribution
It introduces a novel framework for designing optimal MoE architectures and shows they can surpass dense models under strict resource constraints.
Findings
MoE models with optimal activation rates outperform dense models under equal resources.
Optimal MoE design remains consistent across different model sizes.
Reusing data can mitigate the trade-off between data amount and performance.
Abstract
Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints -- that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Machine Learning and Algorithms
