DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

Zeyu Zhu; Gang Li; Peisong Wang; Zitao Mo; Minnan Pei; Zhuoran Song; Xiaoyao Liang; Jian Cheng

arXiv:2602.03495·cs.DC·February 4, 2026

DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

Zeyu Zhu, Gang Li, Peisong Wang, Zitao Mo, Minnan Pei, Zhuoran Song, Xiaoyao Liang, Jian Cheng

PDF

Open Access

TL;DR

DALI is a workload-aware offloading framework that dynamically assigns experts to CPU or GPU, improves prefetching accuracy, and enhances cache efficiency, enabling efficient MoE inference on resource-constrained local PCs.

Contribution

DALI introduces a dynamic expert assignment, residual-based prefetching, and workload-aware cache policies tailored for MoE inference on local PCs, addressing key inefficiencies in existing methods.

Findings

01

Significant speedups in prefill and decoding phases.

02

Effective expert workload prediction improves resource utilization.

03

Enhanced GPU cache hit rates lead to better inference performance.

Abstract

Mixture of Experts (MoE) architectures significantly enhance the capacity of LLMs without proportional increases in computation, but at the cost of a vast parameter size. Offloading MoE expert parameters to host memory and leveraging both CPU and GPU computation has recently emerged as a promising direction to support such models on resourceconstrained local PC platforms. While promising, we notice that existing approaches mismatch the dynamic nature of expert workloads, which leads to three fundamental inefficiencies: (1) Static expert assignment causes severe CPUGPU load imbalance, underutilizing CPU and GPU resources; (2) Existing prefetching techniques fail to accurately predict high-workload experts, leading to costly inaccurate prefetches; (3) GPU cache policies neglect workload dynamics, resulting in poor hit rates and limited effectiveness. To address these challenges, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · IoT and Edge/Fog Computing · Cloud Computing and Resource Management