SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

Yongchao He; Bohan Zhao; Zheng Cao

arXiv:2506.22033·cs.DC·June 30, 2025

SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

Yongchao He, Bohan Zhao, Zheng Cao

PDF

Open Access

TL;DR

SiPipe enhances pipeline-parallel LLM inference by utilizing CPU resources to reduce bubbles and imbalance, achieving significant throughput and latency improvements across various models and setups.

Contribution

SiPipe introduces a heterogeneous pipeline design with CPU offloading, sampling, and structure-aware transmission to improve GPU utilization and throughput in multi-GPU LLM inference.

Findings

01

Up to 2.1x higher throughput compared to vLLM.

02

43% reduction in per-token latency.

03

Up to 23% higher GPU utilization.

Abstract

As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV) cache capacity and inference throughput. However, PP suffers from inherent inefficiencies caused by three types of execution bubbles-load-imbalance, intra-stage, and inter-stage-which limit pipeline saturation. We present SiPipe, a heterogeneous pipeline design that improves throughput by leveraging underutilized CPU resources to offload auxiliary computation and communication. SiPipe incorporates three key techniques-CPU sampling, a token-safe execution model, and structure-aware transmission-to mitigate pipeline bubbles and improve execution efficiency. Across diverse LLMs, SiPipe achieves up to 2.1 times higher throughput, 43% lower per-token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy