First-Order Error Matters: Accurate Compensation for Quantized Large Language Models
Xingyu Zheng, Haotong Qin, Yuye Li, Haoran Chu, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu

TL;DR
This paper introduces FOEM, a novel post-training quantization method for large language models that explicitly incorporates first-order gradient terms, significantly improving quantization accuracy and outperforming existing methods.
Contribution
FOEM is the first quantization approach to explicitly include first-order gradient terms, reducing computational costs while enhancing model performance across various benchmarks.
Findings
FOEM reduces perplexity of Llama3-8B by 17.3% in 3-bit quantization.
FOEM improves 5-shot MMLU accuracy from 53.8% to 56.1%.
FOEM outperforms classical GPTQ and combines well with SpinQuant.
Abstract
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
