Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

Houyi Li; Ka Man Lo; Shijie Xuyang; Ziqi Wang; Wenzhen Zheng; Haocheng Zhang; Zhao Li; Shuigeng Zhou; Xiangyu Zhang; Daxin Jiang

arXiv:2506.12119·cs.CL·May 19, 2026

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

Houyi Li, Ka Man Lo, Shijie Xuyang, Ziqi Wang, Wenzhen Zheng, Haocheng Zhang, Zhao Li, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

PDF

1 Models 1 Video

TL;DR

This paper demonstrates that properly optimized Mixture-of-Experts models can outperform dense language models when constrained to equal total parameters, compute, and data, validated by extensive experiments.

Contribution

It introduces a novel framework for designing optimal MoE architectures and shows they can surpass dense models under strict resource constraints.

Findings

01

MoE models with optimal activation rates outperform dense models under equal resources.

02

Optimal MoE design remains consistent across different model sizes.

03

Reusing data can mitigate the trade-off between data amount and performance.

Abstract

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints -- that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
kamanphoebe/moe_surpass_dense
model

Videos

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource· slideslive

Taxonomy

TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Machine Learning and Algorithms