PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism
Z. Jonny Kong, Qiang Xu, Y. Charlie Hu

TL;DR
PPipe is a novel system that leverages pool-based pipeline parallelism to efficiently serve video analytics workloads on heterogeneous GPU clusters, significantly improving utilization and throughput.
Contribution
The paper introduces PPipe, a new inference serving system that exploits pipeline parallelism and resource-aware batching for heterogeneous GPU clusters.
Findings
Achieves 41.1%-65.5% higher low-class GPU utilization.
Attains 32.2%-75.1% higher serving throughput.
Effectively balances workload across diverse GPU architectures.
Abstract
With the rapid innovation of GPUs, heterogeneous GPU clusters in both public clouds and on-premise data centers have become increasingly commonplace. In this paper, we demonstrate how pipeline parallelism, a technique wellstudied for throughput-oriented deep learning model training, can be used effectively for serving latency-bound model inference, e.g., in video analytics systems, on heterogeneous GPU clusters. Our work exploits the synergy between diversity in model layers and diversity in GPU architectures, which results in comparable inference latency for many layers when running on low-class and high-class GPUs. We explore how such overlooked capability of low-class GPUs can be exploited using pipeline parallelism and present a novel inference serving system, PPipe, that employs pool-based pipeline parallelism via an MILP-based control plane and a data plane that performs resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Cloud Computing and Resource Management
