An efficient and flexible inference system for serving heterogeneous   ensembles of deep neural networks

Pierrick Pochelu; Serge G. Petiton; Bruno Conche

arXiv:2208.14049·cs.DC·August 31, 2022

An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks

Pierrick Pochelu, Serge G. Petiton, Bruno Conche

PDF

Open Access

TL;DR

This paper introduces a novel inference system for serving ensembles of deep neural networks efficiently and flexibly across heterogeneous hardware, optimizing resource allocation and asynchronous processing.

Contribution

It presents a new software layer with innovative allocation and asynchronous execution techniques for ensemble DNN inference, outperforming existing methods.

Findings

01

Successfully serves 12 heavy DNNs on 4 GPUs

02

Achieves up to 2.7x speedup over baseline

03

Supports multi-GPU multi-threaded DNN inference

Abstract

Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational resources. Unlike recent initiatives on inference servers and inference frameworks, which focus on the prediction of single DNNs, we propose a new software layer to serve with flexibility and efficiency ensembles of DNNs. Our inference system is designed with several technical innovations. First, we propose a novel procedure to find a good allocation matrix between devices (CPUs or GPUs) and DNN instances. It runs successively a worst-fit to allocate DNNs into the memory devices and a greedy algorithm to optimize allocation settings and speed up the ensemble. Second, we design the inference system based on multiple processes to run asynchronously:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · IoT and Edge/Fog Computing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings