Scaling Laws For Mixed Quantization

Zeyu Cao; Boyang Gu; Cheng Zhang; Pedro Gimenes; Jianqiao Lu; Jianyi Cheng; Xitong Gao; Yiren Zhao

arXiv:2410.06722·cs.CL·August 8, 2025

Scaling Laws For Mixed Quantization

Zeyu Cao, Boyang Gu, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Xitong Gao, Yiren Zhao

PDF

Open Access

TL;DR

This paper introduces a unified scaling law for post-training quantization of large language models, predicting how quantization parameters affect accuracy and hardware complexity as models grow larger.

Contribution

It proposes a unified scaling law for post-training quantization, linking quantization ratio and block size to model size and accuracy loss, supported by extensive experiments.

Findings

01

Larger models tolerate higher quantization ratios, enabling more aggressive mixed quantization.

02

Small block sizes are not necessary for large models, simplifying hardware design.

03

The scaling law accurately predicts loss degeneration across different models and quantization methods.

Abstract

Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the memory and computational requirements for inference. In this study, we focus on a straightforward question: When aiming for a target accuracy or perplexity with low-precision quantization, how much high-precision computation needs to be preserved, and how fine-grained this quantization would need to be as we scale LLMs to larger sizes? We first introduce two critical metrics, named the quantization ratio ( $Q_{r}$ ) and quantization block size ( $Q_{b}$ ). The former measures the number of parameters quantized to low-precision arithmetic normalized by the total parameter count, whereas the latter defines the number of values within a block that share a scaling factor, akin to the block size concept introduced in the FP4 format in NVIDIA's Blackwell architecture. Through extensive and carefully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling

MethodsFocus