QFT: Post-training quantization via fast joint finetuning of all degrees   of freedom

Alex Finkelstein; Ella Fuchs; Idan Tal; Mark Grobman; Niv Vosco; Eldad; Meller

arXiv:2212.02634·stat.ML·March 21, 2023

QFT: Post-training quantization via fast joint finetuning of all degrees of freedom

Alex Finkelstein, Ella Fuchs, Idan Tal, Mark Grobman, Niv Vosco, Eldad, Meller

PDF

Open Access

TL;DR

This paper introduces QFT, a unified post-training quantization method that jointly finetunes all quantization degrees of freedom, achieving state-of-the-art 4-bit weight quantization results efficiently.

Contribution

It proposes a novel HW-aware joint finetuning approach for all quantization DoFs, enabling effective end-to-end optimization in a single step.

Findings

01

QFT achieves 4-bit weight quantization comparable to state-of-the-art methods.

02

The method is simple, extendable, and fast, suitable for practical deployment.

03

Joint finetuning of all DoFs improves quantization accuracy without multi-step procedures.

Abstract

The post-training quantization (PTQ) challenge of bringing quantized neural net accuracy close to original has drawn much attention driven by industry demand. Many of the methods emphasize optimization of a specific degree-of-freedom (DoF), such as quantization step size, preconditioning factors, bias fixing, often chained to others in multi-step solutions. Here we rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF, permitting for the first time their joint end-to-end finetuning. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA within PTQ constraints of speed and resource.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Model Reduction and Neural Networks · Neural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings