Theory-optimal Quantization Based on Flatness
Xiusheng Huang, Zhe Li, Xuanwu Yin, Lu Wang, Yequan Wang, Dong Li, Emad Barsoum, Kang Liu

TL;DR
This paper introduces a theory-driven quantization method for large language models that minimizes outlier effects, improving accuracy at low bit precision through a novel metric and optimized matrix transformations.
Contribution
It presents a new metric called Flatness, derives the theoretical optimal solution for quantization, and proposes BDQ, a framework that disperses outliers to enhance model compression.
Findings
BDQ achieves less than 1% accuracy drop on LLaMA-3-8B at W4A4 quantization.
BDQ reduces the performance gap by 39.1% on LLaMA-70B in challenging quantization settings.
Theoretical analysis links quantization error to outliers, guiding the design of effective transformations.
Abstract
Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
