Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
En-Ming Huang, Li-Shang Lin, Chun-Yi Lee

TL;DR
This paper introduces a CPU-GPU collaborative inference framework with expert caching for MoE-based LLMs, significantly improving inference speed on memory-limited consumer hardware by reducing data transfer and leveraging multithreading.
Contribution
It proposes a novel expert caching mechanism and CPU-GPU collaboration strategy to enable efficient inference of MoE models on limited-memory systems.
Findings
Performance improvements demonstrated in evaluations
Reduced data transfer through expert caching
Enhanced inference speed on consumer hardware
Abstract
Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency, limiting inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Topic Modeling · Explainable Artificial Intelligence (XAI)
