ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
Yingping Wang, Yi Wu, Xiangyu Wu, Junwei Cui, Weilin Cai, Zhijiang Guo, Jiayi Huang

TL;DR
ReaLB is a real-time load balancing method for multimodal MoE inference that dynamically adjusts expert computation precision to improve efficiency without extra memory or redundant experts.
Contribution
It introduces a zero-overhead, runtime precision adjustment technique for load balancing in multimodal MoE inference systems.
Findings
Achieves 1.10×-1.32× speedup in inference.
Limits accuracy degradation to within 1%.
Effectively balances expert workloads during inference.
Abstract
Mixture-of-Experts (MoE) architectures are widely used in modern large language models and multimodal models. However, inference efficiency is often limited by highly dynamic and skewed expert workloads across different modalities. During the prefill stage with large batch sizes, vision tokens frequently dominate the input sequences. Under expert parallelism (EP), this leads to severe load imbalance, where a subset of devices becomes overloaded, reducing overall system throughput. We propose ReaLB, a real-time load balancing method for multimodal MoE (MMoE) inference that introduces zero scheduling overhead. ReaLB dynamically adjusts the computation precision of MoE experts at runtime on a per-EP-rank basis. For ranks dominated by vision-heavy experts, ReaLB assigns lower-precision computation to improve execution efficiency by exploiting FP4 Tensor Cores. ReaLB does not require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
