MixQuant: Mixed Precision Quantization with a Bit-width Optimization   Search

Eliska Kloberdanz; Wei Le

arXiv:2309.17341·cs.LG·October 2, 2023

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

Eliska Kloberdanz, Wei Le

PDF

Open Access

TL;DR

MixQuant is a novel search algorithm that optimizes layer-wise weight bit-widths in neural network quantization, improving accuracy by minimizing roundoff errors and enhancing existing quantization methods.

Contribution

It introduces a layer-wise bit-width optimization approach for quantization, compatible with any quantization method, to improve model accuracy and efficiency.

Findings

01

MixQuant improves accuracy when combined with BRECQ.

02

MixQuant enhances performance of vanilla asymmetric quantization.

03

Layer-wise bit-width optimization reduces roundoff errors.

Abstract

Quantization is a technique for creating efficient Deep Neural Networks (DNNs), which involves performing computations and storing tensors at lower bit-widths than f32 floating point precision. Quantization reduces model size and inference latency, and therefore allows for DNNs to be deployed on platforms with constrained computational resources and real-time systems. However, quantization can lead to numerical instability caused by roundoff error which leads to inaccurate computations and therefore, a decrease in quantized model accuracy. Similarly to prior works, which have shown that both biases and activations are more sensitive to quantization and are best kept in full precision or quantized with higher bit-widths, we show that some weights are more sensitive than others which should be reflected on their quantization bit-width. To that end we propose MixQuant, a search algorithm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Neural Networks and Applications