FlatQuant: Flatness Matters for LLM Quantization

Yuxuan Sun; Ruikang Liu; Haoli Bai; Han Bao; Kang Zhao; Yuening Li; Jiaxin Hu; Xianzhi Yu; Lu Hou; Chun Yuan; Xin Jiang; Wulong Liu; Jun Yao

arXiv:2410.09426·cs.CL·August 12, 2025·2 cites

FlatQuant: Flatness Matters for LLM Quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

FlatQuant introduces a learnable affine transformation technique to improve LLM quantization by flattening weights and activations, significantly reducing accuracy loss and accelerating inference.

Contribution

The paper proposes FlatQuant, a novel post-training quantization method that learns optimal affine transformations to enhance flatness of weights and activations, outperforming existing methods.

Findings

01

Achieves less than 1% accuracy drop on LLaMA-3-70B with W4A4 quantization.

02

Surpasses SpinQuant by 7.5% in accuracy.

03

Provides up to 2.3x prefill speedup and 1.7x decoding speedup.

Abstract

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruikangliu/flatquant
pytorchOfficial

Models

🤗
ruikangliu/FlatQuant
model· ♡ 4
♡ 4

Videos

FlatQuant: Flatness Matters for LLM Quantization· slideslive

Taxonomy

TopicsAdvancements in Photolithography Techniques · Magnetic confinement fusion research · Medical Imaging Techniques and Applications