Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Jiong Gong, Haihao Shen, Guoming Zhang, Xiaoli Liu, Shane Li, Ge Jin,, Niharika Maheshwari, Evarist Fomenko, Eden Segal

TL;DR
This paper introduces IntelCaffe, an optimized deep learning framework that enables efficient 8-bit low precision inference on Intel Xeon processors, significantly improving throughput and latency with minimal accuracy loss.
Contribution
It presents the first Intel-optimized framework supporting automatic 8-bit model inference without retraining, boosting performance of CNNs on Intel hardware.
Findings
Inference throughput improved by up to 2.9X
Latency reduced by up to 3X
Minimal accuracy loss compared to FP32 baseline
Abstract
High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents the efficient inference techniques of IntelCaffe, the first Intel optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel Xeon Scalable Processors. The 8-bit optimized model is automatically generated with a calibration process from FP32 model without the need of fine-tuning or retraining. We show that the inference throughput and latency with ResNet-50, Inception-v3 and SSD are improved by 1.38X-2.9X and 1.35X-3X respectively with neglectable accuracy loss from IntelCaffe FP32 baseline and by 56X-75X and 26X-37X from BVLC Caffe. All these techniques have been open-sourced on IntelCaffe GitHub1, and the artifact is provided to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Neural Networks and Applications
MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD
