Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

Baihui Liu; Kaiyuan Tian; Wei Wang; Zhaoning Zhang; Linbo Qiao; Dongsheng Li

arXiv:2604.08133·cs.LG·April 10, 2026

Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

Baihui Liu, Kaiyuan Tian, Wei Wang, Zhaoning Zhang, Linbo Qiao, Dongsheng Li

PDF

TL;DR

Alloc-MoE introduces a unified framework for optimizing expert activation distribution in Mixture-of-Experts models, significantly reducing inference latency while preserving performance, especially in resource-limited settings.

Contribution

It proposes a novel activation budget concept and a coordinated optimization method at layer and token levels to improve efficiency without performance loss.

Findings

01

Achieves 1.15x prefill speedup on DeepSeek-V2-Lite.

02

Achieves 1.34x decode speedup with maintained performance.

03

Maintains model accuracy under constrained expert activation budgets.

Abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.