Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu

TL;DR
This paper extends neural scaling laws to Mixture-of-Experts models, deriving an explicit formula for optimal expert-attention compute ratio that improves model efficiency and performance.
Contribution
It introduces a power-law relationship for the expert-attention ratio in MoE models, enabling precise control and optimization of compute allocation based on total compute and sparsity.
Findings
Optimal ratio $r^*$ follows a power-law with total compute.
Explicit formula for $r^*$ enables better model tuning.
Guidelines for efficient MoE model design.
Abstract
This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for , enabling precise control over the expert-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research
