FloE: On-the-Fly MoE Inference on Memory-constrained GPU

Yuxin Zhou; Zheng Li; Jun Zhang; Jue Wang; Yiping Wang; Zhongle Xie; Ke Chen; Lidan Shou

arXiv:2505.05950·cs.LG·May 13, 2025

FloE: On-the-Fly MoE Inference on Memory-constrained GPU

Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou

PDF

Open Access

TL;DR

FloE introduces an on-the-fly MoE inference system that compresses expert parameters and reduces memory usage, enabling efficient deployment on memory-limited GPUs with significant speedup and minimal performance loss.

Contribution

FloE presents a novel compression-based approach for MoE inference on GPUs, addressing memory constraints and latency issues in resource-limited environments.

Findings

01

Achieves 9.3x parameter compression per expert.

02

Enables deployment on 11GB VRAM GPU, reducing memory footprint by 8.5x.

03

Provides 48.7x inference speedup with minimal performance degradation.

Abstract

With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert's internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy

MethodsMixture of Experts