DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient   MoE Inference

Yujie Zhang; Shivam Aggarwal; Tulika Mitra

arXiv:2501.10375·cs.DC·May 6, 2025

DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

Yujie Zhang, Shivam Aggarwal, Tulika Mitra

PDF

Open Access 1 Repo

TL;DR

DAOP is a novel on-device MoE inference engine that dynamically allocates experts between CPU and GPU, using predictive pre-calculation to reduce data transfer latency and improve efficiency on memory-constrained devices.

Contribution

It introduces a dynamic expert allocation and predictive pre-calculation mechanism for efficient MoE inference on resource-limited devices.

Findings

01

DAOP outperforms traditional caching and prefetching methods by up to 8.20x.

02

DAOP achieves 1.35x better performance than offloading techniques.

03

It maintains model accuracy through a graceful degradation mechanism.

Abstract

Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ecolab-nus/DAOP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Radiography and Breast Imaging · IoT and Edge/Fog Computing · Data Quality and Management

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Mixture of Experts