TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

Yudong Pan; Yintao He; Tianhua Han; Lian Liu; Shixin Zhao; Zhirong Chen; Mengdi Wang; Cangyuan Li; Yinhe Han; Ying Wang

arXiv:2603.01058·cs.AR·March 3, 2026

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

Yudong Pan, Yintao He, Tianhua Han, Lian Liu, Shixin Zhao, Zhirong Chen, Mengdi Wang, Cangyuan Li, Yinhe Han, Ying Wang

PDF

Open Access

TL;DR

TriMoE introduces a hybrid GPU-CPU-NDP architecture with intelligent scheduling to efficiently deploy large MoE models, significantly improving inference speed by addressing memory and compute bottlenecks.

Contribution

It proposes a novel architecture and scheduling strategies that optimize expert placement across heterogeneous compute units for MoE inference.

Findings

01

Achieves up to 2.83x speedup over existing solutions.

02

Effectively maps experts to compute units based on their memory and compute characteristics.

03

Demonstrates improved efficiency in large-scale MoE inference.

Abstract

To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Stochastic Gradient Optimization Techniques