FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
Qingxiu Liu, Cyril Y. He, Hanser Jiang, Zion Wang, Alan Zhao, and Patrick P. C. Lee

TL;DR
FluxMoE introduces a novel system for MoE inference that decouples expert parameters from GPU memory, significantly improving throughput by streaming expert weights on demand.
Contribution
FluxMoE proposes an expert paging system that treats expert weights as transient, streamed resources, enabling more efficient GPU memory utilization during large-scale MoE inference.
Findings
Achieves up to 3.0× throughput gains over vLLM in memory-constrained scenarios.
Maintains model fidelity while improving inference efficiency.
Demonstrates effective expert weight streaming under severe memory constraints.
Abstract
Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
