FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search
Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S., Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi,, Quoc V. Le, Sheng Li

TL;DR
This paper introduces FLIQS, a one-shot mixed-precision quantization search method that finds optimal quantization configurations for neural networks without retraining, improving accuracy and efficiency over prior methods.
Contribution
The paper presents the first one-shot mixed-precision quantization search that eliminates retraining, applicable to both integer and floating-point models, and extends to joint architecture and quantization search.
Findings
Improves ResNet-18 accuracy by 1.31% on ImageNet.
Enhances MobileNetV2 FP8 models by up to 0.98%.
Achieves 2.69% higher accuracy with similar model cost in joint search.
Abstract
Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision methods have performed either a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our search (FLIQS) on multiple convolutional and vision transformer networks to discover Pareto-optimal models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Pointwise Convolution · Depthwise Convolution · Depthwise Separable Convolution · Batch Normalization · Softmax · Inverted Residual Block · Linear Layer · Dense Connections
