An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks
Pierrick Pochelu, Serge G. Petiton, Bruno Conche

TL;DR
This paper introduces a novel inference system for serving ensembles of deep neural networks efficiently and flexibly across heterogeneous hardware, optimizing resource allocation and asynchronous processing.
Contribution
It presents a new software layer with innovative allocation and asynchronous execution techniques for ensemble DNN inference, outperforming existing methods.
Findings
Successfully serves 12 heavy DNNs on 4 GPUs
Achieves up to 2.7x speedup over baseline
Supports multi-GPU multi-threaded DNN inference
Abstract
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational resources. Unlike recent initiatives on inference servers and inference frameworks, which focus on the prediction of single DNNs, we propose a new software layer to serve with flexibility and efficiency ensembles of DNNs. Our inference system is designed with several technical innovations. First, we propose a novel procedure to find a good allocation matrix between devices (CPUs or GPUs) and DNN instances. It runs successively a worst-fit to allocate DNNs into the memory devices and a greedy algorithm to optimize allocation settings and speed up the ensemble. Second, we design the inference system based on multiple processes to run asynchronously:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · IoT and Edge/Fog Computing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
