TL;DR
This paper improves residual error formulation in compensation-based LLM quantization, leading to better performance by aligning quantized outputs with full-precision models and incorporating compensation-aware error.
Contribution
It redefines the residual error objective and introduces compensation-aware error, enhancing existing methods like GPTQ and GPTAQ for LLM quantization.
Findings
Significant performance improvements on various LLMs and quantization settings.
Redefining the residual error objective improves alignment with full-precision outputs.
Incorporating compensation-aware error enhances quantization accuracy.
Abstract
Methods based on weight compensation, which iteratively apply quantization and weight compensation to minimize the output error, have recently demonstrated remarkable success in quantizing Large Language Models (LLMs). The representative work, GPTQ, introduces several key techniques that make such iterative methods practical for LLMs with billions of parameters. GPTAQ extends this approach by introducing an asymmetric calibration process that aligns the output of each quantized layer with its full-precision counterpart, incorporating a residual error into the weight compensation framework. In this work, we revisit the formulation of the residual error. We identify a sub-optimal calibration objective in existing methods: during the intra-layer calibration process, they align the quantized output with the output from compensated weights, rather than the true output from the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
