A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

Qi Wu; Chao Fang; Jiayuan Chen; Ye Lin; Yueqi Zhang; Yichuan Bai; Yuan Du; Li Du

arXiv:2601.03992·cs.DC·January 8, 2026

A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

Qi Wu, Chao Fang, Jiayuan Chen, Ye Lin, Yueqi Zhang, Yichuan Bai, Yuan Du, Li Du

PDF

Open Access

TL;DR

This paper introduces a scheduling framework that improves the efficiency of Mixture-of-Experts inference on edge GPU-NDP systems by addressing load imbalance, utilization, and pre-fetching challenges, resulting in significant speedups.

Contribution

It proposes a novel inference framework with tensor parallelism, load-balancing scheduling, and dataset-free pre-fetching tailored for edge GPU-NDP systems.

Findings

01

Achieves up to 2.56x end-to-end latency speedup

02

Effectively balances load across NDP units and GPU

03

Reduces data pre-profiling overhead significantly

Abstract

Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Parallel Computing and Optimization Techniques