CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution
Muyoung Son, Yi Chen, Seungjae Yoo, Soongyu Choi, Joo-Young Kim

TL;DR
CoX-MoE is a CPU-GPU collaborative system that significantly improves MoE inference throughput by optimizing expert execution and workload orchestration using AMX-enabled hardware.
Contribution
It introduces coalesced expert execution and workload stratification techniques to enhance throughput in MoE inference on CPU-GPU systems.
Findings
Achieves up to 7.1x higher throughput than FlexGen.
Delivers up to 2.4x higher throughput than MoE-Lightning.
Effectively mitigates PCIe transfer overhead and balances workload.
Abstract
The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)-enabled CPU-GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
