Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization
Yoshiaki Inoue

TL;DR
This paper models GPU inference servers with dynamic batching as a queueing system, deriving a closed-form latency bound and demonstrating that high utilization improves energy efficiency while maintaining latency constraints.
Contribution
It introduces a novel queueing model with batch-size dependent processing times and provides a simple closed-form latency upper bound for GPU inference servers.
Findings
Energy efficiency increases with higher job arrival rates.
The derived latency upper bound closely approximates actual server performance.
Real-world measurements align well with the theoretical model.
Abstract
GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
