Scaling Law for Quantization-Aware Training
Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo

TL;DR
This paper develops a comprehensive scaling law for quantization-aware training of large language models, revealing how model size, data volume, and quantization granularity influence quantization error, and identifying key error sources.
Contribution
It introduces a unified scaling law for QAT that incorporates training tokens and quantization granularity, and analyzes error components to guide better quantization strategies.
Findings
Quantization error decreases with larger models.
Error increases with more training tokens and coarser quantization.
Activation error in FC2 layer is the main bottleneck for W4A4 QAT.
Abstract
Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight…
Peer Reviews
Decision·Submitted to ICLR 2026
- Significance: The work addresses a critical and timely problem. As W4A4 QAT becomes essential for efficient LLM deployment, understanding its scaling behavior is of high practical importance, yet it remains poorly understood. - Originality: The primary originality lies in the formulation of a scaling law that includes the training data volume (D) and quantization granularity (G). The finding that quantization error increases with D is a novel and non-trivial observation that challenges common
- Limited Generalizability Due to Model Scale: The paper's most significant weakness is the gap between its claims ("Large Language Models") and its experimental setup. The scaling law is derived from models ranging from 74M to 595M parameters, with validation on a 973M model. These models are orders of magnitude smaller than current state-of-the-art LLMs (e.g., 7B, 70B, 100B+). Scaling laws are only valuable if they extrapolate, and there is no evidence that a trend observed in sub-1B models wi
1. The unified scaling law feels intuitive yet backed by solid data; the inclusion of both data and granularity terms makes sense. 2. Clear identification of the FC2 activation bottleneck; the mixed-precision fix is simple and convincing. 3. This paper is a neat empirical work that connects scaling laws and quantization in a meaningful, practically useful way.
1. Experiments stop at sub-1B dense models — unclear if the scaling law still holds for >10B or MoE setups. 2. Mostly focuses on W4A4; doesn’t explore ternary or mixed-bit cases that recent works care about. 3. While empirical results are strong, the practical takeaways for real deployment (beyond FC2 mixed precision) could be discussed more.
- Presents the first unified QAT scaling law incorporating model size (N), dataset size (D), and quantization granularity (G). - Provides clear empirical validation with extensive experiments and well-fitted results - Offers useful diagnostic insights, including weight vs. activation error decomposition and identification of the FC2 activation bottleneck.
- The definition of quantization error differs from conventional quantization studies. In this paper, the “quantization error” is defined as the final training loss gap rather than the difference between quantized and full-precision parameters or inference performance degradation. There is no clear rationale for this definition, and the paper does not show whether a smaller loss gap actually correlates with better downstream performance when compared to BF16-trained counterparts. This makes it d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
