Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching:   A Closed-Form Characterization

Yoshiaki Inoue

arXiv:1912.06322·cs.PF·January 13, 2021

Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

Yoshiaki Inoue

PDF

TL;DR

This paper models GPU inference servers with dynamic batching as a queueing system, deriving a closed-form latency bound and demonstrating that high utilization improves energy efficiency while maintaining latency constraints.

Contribution

It introduces a novel queueing model with batch-size dependent processing times and provides a simple closed-form latency upper bound for GPU inference servers.

Findings

01

Energy efficiency increases with higher job arrival rates.

02

The derived latency upper bound closely approximates actual server performance.

03

Real-world measurements align well with the theoretical model.

Abstract

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings