SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference
Yongchao He, Bohan Zhao, Zheng Cao

TL;DR
SiPipe enhances pipeline-parallel LLM inference by utilizing CPU resources to reduce bubbles and imbalance, achieving significant throughput and latency improvements across various models and setups.
Contribution
SiPipe introduces a heterogeneous pipeline design with CPU offloading, sampling, and structure-aware transmission to improve GPU utilization and throughput in multi-GPU LLM inference.
Findings
Up to 2.1x higher throughput compared to vLLM.
43% reduction in per-token latency.
Up to 23% higher GPU utilization.
Abstract
As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV) cache capacity and inference throughput. However, PP suffers from inherent inefficiencies caused by three types of execution bubbles-load-imbalance, intra-stage, and inter-stage-which limit pipeline saturation. We present SiPipe, a heterogeneous pipeline design that improves throughput by leveraging underutilized CPU resources to offload auxiliary computation and communication. SiPipe incorporates three key techniques-CPU sampling, a token-safe execution model, and structure-aware transmission-to mitigate pipeline bubbles and improve execution efficiency. Across diverse LLMs, SiPipe achieves up to 2.1 times higher throughput, 43% lower per-token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
