Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints
Seng Pei Liew, Kenta Shinzato, Yuyang Dong

TL;DR
This paper investigates the design principles of Mixture-of-Experts language models, revealing that total parameters and expert sparsity are key to optimal performance under memory and inference constraints.
Contribution
It introduces a principled framework for MoE architecture design focusing on maximizing total parameters and optimizing expert sparsity within resource limits.
Findings
Performance is mainly determined by total parameters and expert sparsity.
Larger total number of experts slightly reduces performance due to model dimension constraints.
A simple design principle is proposed to optimize MoE architecture under constraints.
Abstract
Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters () and expert sparsity (). Moreover, and do not "cancel out" within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes while minimizing (maximizing ) and under the given constraints. Our findings provide a robust framework for resolving architectural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning
