The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Hengjie Cao; Zhendong Huang; Mengyi Chen; Yifeng Yang; Fanqi Yu; Ruijun Huang; Fang Dong; Xin Zhang; Jixian Zhou; Anrui Chen; Mingzhi Dong; Yujiang Wang; Jinlong Hou; Qin Lv; Yuan Cheng; Tun Lu; Fan Yang; Li Shang

arXiv:2603.10444·cs.LG·March 12, 2026

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

PDF

Open Access

TL;DR

This paper identifies a rank-one mean bias as the main cause of instability in low-bit quantized LLM training and proposes a simple mean subtraction method to improve stability and performance.

Contribution

It reveals the role of mean bias in spectral anisotropy and introduces a straightforward bias removal technique to enhance low-bit LLM training stability.

Findings

01

Mean bias accounts for most spectral anisotropy in LLMs.

02

Mean subtraction significantly improves low-bit training stability.

03

Performance approaches BF16 accuracy after bias removal.

Abstract

Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis