Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading
Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang

TL;DR
FineMoE introduces a fine-grained expert offloading system for MoE-based LLMs, significantly reducing inference latency and memory usage by intelligently guiding expert prefetching and offloading based on input and model patterns.
Contribution
The paper presents FineMoE, a novel system that improves MoE serving efficiency by fine-grained expert offloading guided by expert selection patterns and input semantics.
Findings
Reduces inference latency by 47%
Improves expert hit rate by 39%
Demonstrates effectiveness on open-source models and real workloads
Abstract
Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present FineMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design FineMoE to extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Distributed Sensor Networks and Detection Algorithms · Anomaly Detection Techniques and Applications
MethodsMixture of Experts
