FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

Qingxiu Liu; Cyril Y. He; Hanser Jiang; Zion Wang; Alan Zhao; and Patrick P. C. Lee

arXiv:2604.02715·cs.LG·May 1, 2026

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

Qingxiu Liu, Cyril Y. He, Hanser Jiang, Zion Wang, Alan Zhao, and Patrick P. C. Lee

PDF

TL;DR

FluxMoE introduces a novel system for MoE inference that decouples expert parameters from GPU memory, significantly improving throughput by streaming expert weights on demand.

Contribution

FluxMoE proposes an expert paging system that treats expert weights as transient, streamed resources, enabling more efficient GPU memory utilization during large-scale MoE inference.

Findings

01

Achieves up to 3.0× throughput gains over vLLM in memory-constrained scenarios.

02

Maintains model fidelity while improving inference efficiency.

03

Demonstrates effective expert weight streaming under severe memory constraints.

Abstract

Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.