SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based   Platforms

Yaodan Xu; Jingzhou Sun; Sheng Zhou; Zhisheng Niu

arXiv:2301.12865·cs.LG·September 4, 2023

SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms

Yaodan Xu, Jingzhou Sun, Sheng Zhou, Zhisheng Niu

PDF

Open Access

TL;DR

This paper introduces a semi-Markov decision process-based dynamic batching policy for GPU inference that optimally balances response time and power consumption, improving efficiency and adaptability over existing methods.

Contribution

It formulates the batching problem as an SMDP, proposes an efficient solution with reduced complexity, and demonstrates superior performance and flexibility in balancing latency and power.

Findings

01

Optimal policies have a control limit structure.

02

SMDP-based policies outperform benchmarks across traffic conditions.

03

Proposed method reduces computational complexity significantly.

Abstract

In up-to-date machine learning (ML) applications on cloud or edge computing platforms, batching is an important technique for providing efficient and economical services at scale. In particular, parallel computing resources on the platforms, such as graphics processing units (GPUs), have higher computational and energy efficiency with larger batch sizes. However, larger batch sizes may also result in longer response time, and thus it requires a judicious design. This paper aims to provide a dynamic batching policy that strikes a balance between efficiency and latency. The GPU-based inference service is modeled as a batch service queue with batch-size dependent processing time. Then, the design of dynamic batching is a continuous-time average-cost problem, and is formulated as a semi-Markov decision process (SMDP) with the objective of minimizing the weighted sum of average response time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Age of Information Optimization · Cloud Computing and Resource Management

Methodstravel james