Fast MoE Inference via Predictive Prefetching and Expert Replication
Ankit Jyothish, Ali Jannesari, Aishwarya Sarkar, Joseph Zuber

TL;DR
This paper introduces a dynamic expert replication method for MoE models that predicts and replicates overloaded experts, significantly enhancing GPU utilization and inference speed in large language models.
Contribution
It presents a novel expert replication strategy that improves parallelism and reduces latency in MoE inference without major performance loss.
Findings
Achieves near 100% GPU utilization during inference.
Provides up to 3x speedup in inference time.
Maintains approximately 90-95% of baseline model performance.
Abstract
The Mixture of Experts (MoE) architecture has become a fundamental building block in state-of-the-art large language models (LLMs), improving domain-specific expertise in LLMs and scaling model capacity without proportionally increasing their computational overhead. However, MoE inference often suffers from suboptimal GPU utilization, load imbalance, and elevated latency arising from multiple tokens waiting on the same experts for their computation which arises from sparsity of expert activation. To address these challenges, we propose a dynamic expert replication strategy that predicts which experts are likely to be overloaded and replicates them for upcoming batches of tokens. The replicated experts process batch tokens concurrently across layers, which leads to improved parallelism, shorter GPU idle time, and significantly faster inference. Experimental evaluations conducted on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
