A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems
Qi Wu, Chao Fang, Jiayuan Chen, Ye Lin, Yueqi Zhang, Yichuan Bai, Yuan Du, Li Du

TL;DR
This paper introduces a scheduling framework that improves the efficiency of Mixture-of-Experts inference on edge GPU-NDP systems by addressing load imbalance, utilization, and pre-fetching challenges, resulting in significant speedups.
Contribution
It proposes a novel inference framework with tensor parallelism, load-balancing scheduling, and dataset-free pre-fetching tailored for edge GPU-NDP systems.
Findings
Achieves up to 2.56x end-to-end latency speedup
Effectively balances load across NDP units and GPU
Reduces data pre-profiling overhead significantly
Abstract
Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Parallel Computing and Optimization Techniques
