Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

En-Ming Huang; Li-Shang Lin; Chun-Yi Lee

arXiv:2512.16473·cs.DC·December 19, 2025

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

En-Ming Huang, Li-Shang Lin, Chun-Yi Lee

PDF

Open Access

TL;DR

This paper introduces a CPU-GPU collaborative inference framework with expert caching for MoE-based LLMs, significantly improving inference speed on memory-limited consumer hardware by reducing data transfer and leveraging multithreading.

Contribution

It proposes a novel expert caching mechanism and CPU-GPU collaboration strategy to enable efficient inference of MoE models on limited-memory systems.

Findings

01

Performance improvements demonstrated in evaluations

02

Reduced data transfer through expert caching

03

Enhanced inference speed on consumer hardware

Abstract

Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency, limiting inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Topic Modeling · Explainable Artificial Intelligence (XAI)