Serving DNN Models with Multi-Instance GPUs: A Case of the   Reconfigurable Machine Scheduling Problem

Cheng Tan; Zhichao Li; Jian Zhang; Yu Cao; Sikai Qi; Zherui Liu; Yibo; Zhu; Chuanxiong Guo

arXiv:2109.11067·cs.DC·September 24, 2021

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo, Zhu, Chuanxiong Guo

PDF

TL;DR

This paper introduces MIG-serving, an algorithm pipeline for efficiently partitioning NVIDIA A100 GPUs for DNN serving, significantly reducing GPU usage while maintaining throughput, by solving a new NP-hard scheduling problem.

Contribution

It defines the Reconfigurable Machine Scheduling Problem (RMS) and proposes MIG-serving, a novel solution combining multiple algorithms for optimal GPU partitioning in DNN serving.

Findings

01

MIG-serving can save up to 40% of GPUs compared to default A100 usage.

02

The solution effectively balances GPU partitioning with throughput requirements.

03

Experimental results validate the efficiency of the proposed algorithms.

Abstract

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs that partitions one physical GPU into multiple GPU instances. With MIG, A100 can be the most cost-efficient GPU ever for serving Deep Neural Networks (DNNs). However, discovering the most efficient GPU partitions is challenging. The underlying problem is NP-hard; moreover, it is a new abstract problem, which we define as the Reconfigurable Machine Scheduling Problem (RMS). This paper studies serving DNNs with MIG, a new case of RMS. We further propose a solution, MIG-serving. MIG- serving is an algorithm pipeline that blends a variety of newly designed algorithms and customized classic algorithms, including a heuristic greedy algorithm, Genetic Algorithm (GA), and Monte Carlo Tree Search algorithm (MCTS). We implement MIG-serving on Kubernetes. Our experiments show that compared to using A100 as-is, MIG-serving can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.