Theory-optimal Quantization Based on Flatness

Xiusheng Huang; Zhe Li; Xuanwu Yin; Lu Wang; Yequan Wang; Dong Li; Emad Barsoum; Kang Liu

arXiv:2605.18800·cs.LG·May 20, 2026

Theory-optimal Quantization Based on Flatness

Xiusheng Huang, Zhe Li, Xuanwu Yin, Lu Wang, Yequan Wang, Dong Li, Emad Barsoum, Kang Liu

PDF

TL;DR

This paper introduces a theory-driven quantization method for large language models that minimizes outlier effects, improving accuracy at low bit precision through a novel metric and optimized matrix transformations.

Contribution

It presents a new metric called Flatness, derives the theoretical optimal solution for quantization, and proposes BDQ, a framework that disperses outliers to enhance model compression.

Findings

01

BDQ achieves less than 1% accuracy drop on LLaMA-3-8B at W4A4 quantization.

02

BDQ reduces the performance gap by 39.1% on LLaMA-70B in challenging quantization settings.

03

Theoretical analysis links quantization error to outliers, guiding the design of effective transformations.

Abstract

Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.