Efficient Inferencing of Compressed Deep Neural Networks
Dharma Teja Vooturi, Saurabh Goyal, Anamitra R. Choudhury, Yogish, Sabharwal, Ashish Verma

TL;DR
This paper introduces parallel algorithms for efficient inference on compressed deep neural networks, especially Huffman-encoded models, improving throughput by 15-25% under memory constraints.
Contribution
It presents novel parallel inference algorithms tailored for compressed models, focusing on Huffman encoding and variable batch sizes to enhance performance.
Findings
Achieves 15-25% inference throughput improvement on AlexNet.
Maintains memory and latency constraints during inference.
Provides algorithms applicable to low-memory environments.
Abstract
Large number of weights in deep neural networks makes the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as "inferencing as a service" environments on cloud. Prior work has considered reduction in the size of the models, through compression techniques like pruning, quantization, Huffman encoding etc. However, efficient inferencing using the compressed models has received little attention, specially with the Huffman encoding in place. In this paper, we propose efficient parallel algorithms for inferencing of single image and batches, under various memory constraints. Our experimental results show that our approach of using variable batch size for inferencing achieves 15-25\% performance improvement in the inference throughput for AlexNet, while maintaining memory and latency constraints.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
Methods1x1 Convolution · Convolution · Local Response Normalization · Grouped Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Dense Connections · Max Pooling · Softmax · How do I speak to a person at Expedia?-/+/
