Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems
Zehao Fan, Zhenyu Liu, Yunzhen Liu, Yayue Hou, Hadjer Benmeziane, Kaoutar El Maghraoui, Liu Liu

TL;DR
This paper presents a context-aware, CXL-enabled GPU-NDP system for Mixture-of-Experts models that significantly improves inference throughput by optimizing expert placement and quantization, reducing memory transfer costs.
Contribution
It introduces a novel context-aware MoE inference system with dynamic expert placement and mixed-precision quantization on CXL-NDP, enhancing throughput and efficiency.
Findings
Achieves up to 8.7x decoding throughput improvement
Only 0.13% average accuracy drop
Effectively overlaps GPU and NDP execution
Abstract
Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
