Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

Abhimanyu Bambhaniya; Geonhwa Jeong; Jason Park; Jiecao Yu; Jaewon Lee; Pengchao Wang; Changkyu Kim; Chunqiang Tang; Tushar Krishna

arXiv:2604.23150·cs.LG·April 28, 2026

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna

PDF

TL;DR

This paper analyzes challenges in multi-node MoE inference for large language models, profiling expert activation patterns and proposing workload-aware strategies to reduce communication overhead and improve efficiency.

Contribution

It systematically characterizes expert activation properties and introduces workload-aware micro-batch grouping and expert placement strategies to optimize multi-node MoE inference.

Findings

01

Profiling reveals persistent expert load imbalance and domain-specific activation patterns.

02

Proposed strategies reduce inter-node communication by up to 20x.

03

Optimizations lead to lower latency and better accelerator utilization.

Abstract

Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.