OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
Liujianfu Wang, Yuyang Du, Yuchen Pan, Soung Chang Liew, Jiacheng Liu, Kexin Chen

TL;DR
OD-MoE introduces a fully on-demand expert loading system for edge MoE inference, eliminating expert caches, reducing memory usage, and maintaining high accuracy and speed, enabling deployment on low-memory edge devices.
Contribution
It proposes a novel on-demand expert loading framework with predictive expert activation, improving memory efficiency and inference speed for edge MoE deployment.
Findings
Achieves 99.94% expert activation prediction accuracy.
Delivers ~75% of the speed of GPU-cached MoE.
Uses only 1/3 of GPU memory, enabling deployment on devices with less than 1GB memory.
Abstract
Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model (LLM) architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense LLMs. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert caches via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Mobile Crowdsensing and Crowdsourcing
