MoE-Infinity: Efficient MoE Inference on Personal Machines with   Sparsity-Aware Expert Cache

Leyang Xue; Yao Fu; Zhan Lu; Luo Mai; Mahesh Marina

arXiv:2401.14361·cs.LG·March 14, 2025·6 cites

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina

PDF

Open Access 2 Repos

TL;DR

MoE-Infinity introduces a sparsity-aware expert cache system that significantly accelerates MoE model inference on personal machines by exploiting activation sparsity, achieving up to 16.7x latency reduction.

Contribution

This work presents a novel sparsity-aware expert cache for MoE inference, optimizing performance on personal devices with limited memory by leveraging activation sparsity patterns.

Findings

01

Achieves 3.1-16.7x latency improvements over state-of-the-art systems.

02

Effectively leverages expert activation sparsity during inference.

03

Demonstrates significant speedups across various MoE models and tasks.

Abstract

This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · IoT and Edge/Fog Computing · Context-Aware Activity Recognition Systems