PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching
Qianchao Zhu, Xucheng Ye, Yuliang Liu, Haodong Ouyang, Chengru Song

TL;DR
PROBE is a real-time inference system for Mixture-of-Experts models that proactively predicts and prefetches expert activations, balancing computation and communication to reduce latency and improve throughput in dynamic workloads.
Contribution
PROBE introduces a novel real-time predictive prefetching system with a lookahead predictor, dynamic expert balancing, and co-scheduling to optimize MoE inference performance.
Findings
Reduces prefill latency by up to 1.32X
Improves decoding throughput by up to 1.26X
Effective under extreme workload volatility
Abstract
Mixture-of-Experts models have become a dominant architecture for scaling Large Language Models by activating only a sparse subset of experts per token. However, latency-critical MoE inference faces a fundamental tension: while expert parallelism improves memory efficiency, it also amplifies execution stragglers. In real-world serving, continuous batching and diverse concurrent requests induce rapid semantic shifts, causing expert hotspots to migrate abruptly across GPUs and triggering the 'double penalty' of coupled computational skew and network congestion. We propose PROBE, an inference system that co-balances computation and communication in real time. PROBE introduces Continuous Lookahead Pipelining, which proactively predicts, plans, and prefetches for upcoming layers while keeping all control overheads off the critical path. PROBE consists of: (1) a Gate-Initialized Lookahead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · IoT and Edge/Fog Computing
