HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE   Inference

Peng Tang; Jiacheng Liu; Xiaofeng Hou; Yifei Pu; Jing Wang; Pheng-Ann; Heng; Chao Li; Minyi Guo

arXiv:2411.01433·cs.LG·November 7, 2024·2 cites

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann, Heng, Chao Li, Minyi Guo

PDF

Open Access

TL;DR

HOBBIT is a system that uses mixed precision techniques and dynamic expert offloading to significantly speed up MoE model inference on edge devices without sacrificing accuracy.

Contribution

HOBBIT introduces a novel hierarchical expert offloading approach with dynamic loading, prefetching, and caching, enabling efficient MoE inference on memory-constrained devices.

Findings

01

Achieves up to 9.93x decoding speedup

02

Reduces expert-loading latency significantly

03

Maintains model accuracy with low precision experts

Abstract

The Mixture-of-Experts (MoE) architecture has demonstrated significant advantages in the era of Large Language Models (LLMs), offering enhanced capabilities with reduced inference costs. However, deploying MoE-based LLMs on memoryconstrained edge devices remains challenging due to their substantial memory requirements. While existing expertoffloading methods alleviate the memory requirements, they often incur significant expert-loading costs or compromise model accuracy. We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency while preserving model accuracy. HOBBIT introduces three innovative techniques that map the natural hierarchy of MoE computation: (1) a token-level dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Service-Oriented Architecture and Web Services

MethodsMixture of Experts