FlatQuant: Flatness Matters for LLM Quantization
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao

TL;DR
FlatQuant introduces a learnable affine transformation technique to improve LLM quantization by flattening weights and activations, significantly reducing accuracy loss and accelerating inference.
Contribution
The paper proposes FlatQuant, a novel post-training quantization method that learns optimal affine transformations to enhance flatness of weights and activations, outperforming existing methods.
Findings
Achieves less than 1% accuracy drop on LLaMA-3-70B with W4A4 quantization.
Surpasses SpinQuant by 7.5% in accuracy.
Provides up to 2.3x prefill speedup and 1.7x decoding speedup.
Abstract
Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvancements in Photolithography Techniques · Magnetic confinement fusion research · Medical Imaging Techniques and Applications
