OAC: Output-adaptive Calibration for Accurate Post-training Quantization
Ali Edalati, Alireza Ghaffari, Mahsa Ghazvini Nejad, Lu Hou, Boxing, Chen, Masoud Asgharian, Vahid Partovi Nia

TL;DR
This paper introduces Output-adaptive Calibration (OAC), a novel post-training quantization method for large language models that incorporates output information to improve accuracy at low bit-widths.
Contribution
OAC formulates a new output-aware quantization error and approximates an output-adaptive Hessian to enhance low-precision LLM quantization accuracy.
Findings
OAC outperforms state-of-the-art methods like SpQR and BiLLM.
OAC achieves better accuracy at 2-bit and binary quantization.
The method effectively maintains model output quality after quantization.
Abstract
Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced MRI Techniques and Applications
