Scaling Laws for Floating Point Quantization Training

Xingwu Sun; Shuaipeng Li; Ruobing Xie; Weidong Han; Kan Wu; Zhen Yang; Yixing Li; An Wang; Shuai Li; Jinbao Xue; Yu Cheng; Yangyu Tao; Zhanhui Kang; Chengzhong Xu; Di Wang; Jie Jiang

arXiv:2501.02423·cs.LG·June 5, 2025

Scaling Laws for Floating Point Quantization Training

Xingwu Sun, Shuaipeng Li, Ruobing Xie, Weidong Han, Kan Wu, Zhen Yang, Yixing Li, An Wang, Shuai Li, Jinbao Xue, Yu Cheng, Yangyu Tao, Zhanhui Kang, Chengzhong Xu, Di Wang, Jie Jiang

PDF

Open Access

TL;DR

This paper develops a comprehensive scaling law for floating-point quantization in large language model training, revealing optimal bit configurations and data size effects to improve efficiency and performance.

Contribution

It introduces an accurate unified scaling law for FP quantization in LLM training and offers practical guidelines for hardware and training optimization.

Findings

01

Exponent bits slightly more impactful than mantissa bits

02

Critical data size limits low-precision training degradation

03

Optimal FP quantization precision between 4-8 bits

Abstract

Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point (FP) quantization, and thus cannot well fit the LLM losses in this scenario. In contrast, while FP quantization training is more commonly implemented in production, it's research has been relatively superficial. In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models. In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhotonic and Optical Devices · Digital Filter Design and Implementation

MethodsSoftmax · Attention Is All You Need · Focus