Efficient Inferencing of Compressed Deep Neural Networks

Dharma Teja Vooturi; Saurabh Goyal; Anamitra R. Choudhury; Yogish; Sabharwal; Ashish Verma

arXiv:1711.00244·cs.DC·November 2, 2017·2 cites

Efficient Inferencing of Compressed Deep Neural Networks

Dharma Teja Vooturi, Saurabh Goyal, Anamitra R. Choudhury, Yogish, Sabharwal, Ashish Verma

PDF

Open Access

TL;DR

This paper introduces parallel algorithms for efficient inference on compressed deep neural networks, especially Huffman-encoded models, improving throughput by 15-25% under memory constraints.

Contribution

It presents novel parallel inference algorithms tailored for compressed models, focusing on Huffman encoding and variable batch sizes to enhance performance.

Findings

01

Achieves 15-25% inference throughput improvement on AlexNet.

02

Maintains memory and latency constraints during inference.

03

Provides algorithms applicable to low-memory environments.

Abstract

Large number of weights in deep neural networks makes the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as "inferencing as a service" environments on cloud. Prior work has considered reduction in the size of the models, through compression techniques like pruning, quantization, Huffman encoding etc. However, efficient inferencing using the compressed models has received little attention, specially with the Huffman encoding in place. In this paper, we propose efficient parallel algorithms for inferencing of single image and batches, under various memory constraints. Our experimental results show that our approach of using variable batch size for inferencing achieves 15-25\% performance improvement in the inference throughput for AlexNet, while maintaining memory and latency constraints.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

Methods1x1 Convolution · Convolution · Local Response Normalization · Grouped Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Dense Connections · Max Pooling · Softmax · How do I speak to a person at Expedia?-/+/