QFT: Post-training quantization via fast joint finetuning of all degrees of freedom
Alex Finkelstein, Ella Fuchs, Idan Tal, Mark Grobman, Niv Vosco, Eldad, Meller

TL;DR
This paper introduces QFT, a unified post-training quantization method that jointly finetunes all quantization degrees of freedom, achieving state-of-the-art 4-bit weight quantization results efficiently.
Contribution
It proposes a novel HW-aware joint finetuning approach for all quantization DoFs, enabling effective end-to-end optimization in a single step.
Findings
QFT achieves 4-bit weight quantization comparable to state-of-the-art methods.
The method is simple, extendable, and fast, suitable for practical deployment.
Joint finetuning of all DoFs improves quantization accuracy without multi-step procedures.
Abstract
The post-training quantization (PTQ) challenge of bringing quantized neural net accuracy close to original has drawn much attention driven by industry demand. Many of the methods emphasize optimization of a specific degree-of-freedom (DoF), such as quantization step size, preconditioning factors, bias fixing, often chained to others in multi-step solutions. Here we rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF, permitting for the first time their joint end-to-end finetuning. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA within PTQ constraints of speed and resource.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Model Reduction and Neural Networks · Neural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
