Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Junzhuo Li; Peijie Jiang; Changxin Tian; Jia Liu; Zhiqiang Zhang; Xuming Hu

arXiv:2603.10379·cs.LG·March 12, 2026

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu

PDF

Open Access

TL;DR

This paper extends neural scaling laws to Mixture-of-Experts models, deriving an explicit formula for optimal expert-attention compute ratio that improves model efficiency and performance.

Contribution

It introduces a power-law relationship for the expert-attention ratio in MoE models, enabling precise control and optimization of compute allocation based on total compute and sparsity.

Findings

01

Optimal ratio $r^*$ follows a power-law with total compute.

02

Explicit formula for $r^*$ enables better model tuning.

03

Guidelines for efficient MoE model design.

Abstract

This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^{*}$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^{*}$ , enabling precise control over the expert-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research