Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna

TL;DR
This paper analyzes challenges in multi-node MoE inference for large language models, profiling expert activation patterns and proposing workload-aware strategies to reduce communication overhead and improve efficiency.
Contribution
It systematically characterizes expert activation properties and introduces workload-aware micro-batch grouping and expert placement strategies to optimize multi-node MoE inference.
Findings
Profiling reveals persistent expert load imbalance and domain-specific activation patterns.
Proposed strategies reduce inter-node communication by up to 20x.
Optimizations lead to lower latency and better accelerator utilization.
Abstract
Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
