Mixture of Diverse Size Experts
Manxi Sun, Wei Liu, Jian Luan, Pengzhi Gao, Bin Wang

TL;DR
This paper introduces MoDSE, a novel mixture-of-experts architecture with experts of varying sizes, improving language model performance and efficiency by better matching expert size to token needs.
Contribution
The paper proposes a new MoE design with diverse expert sizes and an expert-pair allocation strategy to balance workload and enhance model performance.
Findings
Experts of different sizes improve prediction accuracy.
Stable routing paths after training enhance model reliability.
Workload balancing strategy ensures efficient GPU utilization.
Abstract
The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs. Despite its success, the current design faces a challenge where all experts have the same size, limiting the ability of tokens to choose the experts with the most appropriate size for generating the next token. In this paper, we propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes. Our analysis of difficult token generation tasks shows that experts of various sizes achieve better predictions, and the routing path of the experts tends to be stable after a training period. However, having experts of diverse sizes can lead to uneven workload distribution. To tackle this limitation, we introduce an expert-pair allocation strategy to evenly distribute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSurvey Sampling and Estimation Techniques · Bayesian Methods and Mixture Models · SARS-CoV-2 detection and testing
MethodsMixture of Experts
