Hardware-Centric AutoML for Mixed-Precision Quantization
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han

TL;DR
This paper presents a hardware-aware AutoML framework using reinforcement learning to optimize mixed-precision quantization policies for neural networks, improving efficiency tailored to specific hardware architectures.
Contribution
Introduces HAQ, a reinforcement learning-based framework that automatically finds optimal mixed-precision quantization policies considering hardware feedback, unlike traditional uniform approaches.
Findings
Reduced latency by up to 1.95x
Lowered energy consumption by 1.9x
Customized quantization policies for different hardware architectures
Abstract
Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization · Fault Detection and Control Systems · Image Processing Techniques and Applications
