CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

Muyoung Son; Yi Chen; Seungjae Yoo; Soongyu Choi; Joo-Young Kim

arXiv:2605.17889·cs.LG·May 20, 2026

CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

Muyoung Son, Yi Chen, Seungjae Yoo, Soongyu Choi, Joo-Young Kim

PDF

TL;DR

CoX-MoE is a CPU-GPU collaborative system that significantly improves MoE inference throughput by optimizing expert execution and workload orchestration using AMX-enabled hardware.

Contribution

It introduces coalesced expert execution and workload stratification techniques to enhance throughput in MoE inference on CPU-GPU systems.

Findings

01

Achieves up to 7.1x higher throughput than FlexGen.

02

Delivers up to 2.4x higher throughput than MoE-Lightning.

03

Effectively mitigates PCIe transfer overhead and balances workload.

Abstract

The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)-enabled CPU-GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.