Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading

Hanfei Yu; Xingqi Cui; Hong Zhang; Hao Wang; Hao Wang

arXiv:2502.05370·cs.LG·October 7, 2025

Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading

Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang

PDF

Open Access

TL;DR

FineMoE introduces a fine-grained expert offloading system for MoE-based LLMs, significantly reducing inference latency and memory usage by intelligently guiding expert prefetching and offloading based on input and model patterns.

Contribution

The paper presents FineMoE, a novel system that improves MoE serving efficiency by fine-grained expert offloading guided by expert selection patterns and input semantics.

Findings

01

Reduces inference latency by 47%

02

Improves expert hit rate by 39%

03

Demonstrates effectiveness on open-source models and real workloads

Abstract

Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present FineMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design FineMoE to extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Distributed Sensor Networks and Detection Algorithms · Anomaly Detection Techniques and Applications

MethodsMixture of Experts