Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Euijun Chung; Yuxiao Jia; Aaron Jezghani; Hyesoon Kim

arXiv:2603.22774·cs.AR·March 25, 2026

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Euijun Chung, Yuxiao Jia, Aaron Jezghani, Hyesoon Kim

PDF

Open Access

TL;DR

This paper investigates how CPU limitations cause performance bottlenecks in multi-GPU large language model inference, revealing that increasing CPU resources can significantly improve throughput and latency without extra GPUs.

Contribution

It systematically analyzes CPU-induced slowdowns in multi-GPU LLM inference and demonstrates that augmenting CPU cores enhances performance and stability at minimal cost.

Findings

01

CPU bottlenecks cause GPU underutilization and increased latency.

02

Adding CPU cores reduces time-to-first-token by up to 5.40x.

03

CPU provisioning is critical for optimal multi-GPU LLM inference performance.

Abstract

Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving workloads, we find that multi-GPU performance frequently degrades not because GPUs are saturated, but because CPUs fail to keep the GPUs busy. Under limited CPU allocations, systems exhibit symptoms such as delayed kernel launch, stalled communication, and increased tokenization latency, leading to severe GPU underutilization even when ample GPU resources are available. This work presents a systematic analysis of CPU-induced slowdowns in multi-GPU LLM inference. We show that these bottlenecks persist even in serving stacks that employ process-level separation and modern GPU-side optimizations such as CUDA Graphs. Since the marginal cost of additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy