OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Liujianfu Wang; Yuyang Du; Yuchen Pan; Soung Chang Liew; Jiacheng Liu; Kexin Chen

arXiv:2512.03927·cs.DC·December 4, 2025

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Liujianfu Wang, Yuyang Du, Yuchen Pan, Soung Chang Liew, Jiacheng Liu, Kexin Chen

PDF

Open Access

TL;DR

OD-MoE introduces a fully on-demand expert loading system for edge MoE inference, eliminating expert caches, reducing memory usage, and maintaining high accuracy and speed, enabling deployment on low-memory edge devices.

Contribution

It proposes a novel on-demand expert loading framework with predictive expert activation, improving memory efficiency and inference speed for edge MoE deployment.

Findings

01

Achieves 99.94% expert activation prediction accuracy.

02

Delivers ~75% of the speed of GPU-cached MoE.

03

Uses only 1/3 of GPU memory, enabling deployment on devices with less than 1GB memory.

Abstract

Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model (LLM) architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense LLMs. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert caches via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Mobile Crowdsensing and Crowdsourcing