FloE: On-the-Fly MoE Inference on Memory-constrained GPU
Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou

TL;DR
FloE introduces an on-the-fly MoE inference system that compresses expert parameters and reduces memory usage, enabling efficient deployment on memory-limited GPUs with significant speedup and minimal performance loss.
Contribution
FloE presents a novel compression-based approach for MoE inference on GPUs, addressing memory constraints and latency issues in resource-limited environments.
Findings
Achieves 9.3x parameter compression per expert.
Enables deployment on 11GB VRAM GPU, reducing memory footprint by 8.5x.
Provides 48.7x inference speedup with minimal performance degradation.
Abstract
With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert's internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy
MethodsMixture of Experts
