HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann, Heng, Chao Li, Minyi Guo

TL;DR
HOBBIT is a system that uses mixed precision techniques and dynamic expert offloading to significantly speed up MoE model inference on edge devices without sacrificing accuracy.
Contribution
HOBBIT introduces a novel hierarchical expert offloading approach with dynamic loading, prefetching, and caching, enabling efficient MoE inference on memory-constrained devices.
Findings
Achieves up to 9.93x decoding speedup
Reduces expert-loading latency significantly
Maintains model accuracy with low precision experts
Abstract
The Mixture-of-Experts (MoE) architecture has demonstrated significant advantages in the era of Large Language Models (LLMs), offering enhanced capabilities with reduced inference costs. However, deploying MoE-based LLMs on memoryconstrained edge devices remains challenging due to their substantial memory requirements. While existing expertoffloading methods alleviate the memory requirements, they often incur significant expert-loading costs or compromise model accuracy. We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency while preserving model accuracy. HOBBIT introduces three innovative techniques that map the natural hierarchy of MoE computation: (1) a token-level dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Service-Oriented Architecture and Web Services
MethodsMixture of Experts
