Dissecting Outlier Dynamics in LLM NVFP4 Pretraining
Peijie Dong, Ruibo Fan, Yuechen Tao, Di Mou, Wenhu Hu, Zhenheng Tang, Yinghao Yu, Jiamang Wang, Wenbo Su, Guodong Yang, Liping Zhang, Xiaowen Chu, Baochun Li, and Bo Li

TL;DR
This paper analyzes outlier behaviors during NVFP4 pretraining of large language models, identifying architectural sources of outliers and proposing a novel online compensation method to improve training accuracy.
Contribution
It provides a detailed longitudinal analysis of outlier dynamics in NVFP4 training and introduces HCP and CHON to mitigate outlier effects and close the loss gap.
Findings
Outliers evolve from transient spikes to persistent hot channels during training.
Linear Attention reduces heavy tails but still exhibits block-level spikes.
CHON training recipe significantly reduces the loss gap to BF16.
Abstract
Training large language models using 4-bit arithmetic enhances throughput and memory efficiency. Yet, the limited dynamic range of FP4 increases sensitivity to outliers. While NVFP4 mitigates quantization error via hierarchical microscaling, a persistent loss gap remains compared to BF16. This study conducts a longitudinal analysis of outlier dynamics across architecture during NVFP4 pretraining, focusing on where they localize, why they occur, and how they evolve temporally. We find that, compared with Softmax Attention (SA), Linear Attention (LA) reduces per-tensor heavy tails but still exhibits persistent block-level spikes under block quantization. Our analysis attributes outliers to specific architectural components: Softmax in SA, gating in LA, and SwiGLU in FFN, with "post-QK" operations exhibiting higher sensitivity to quantization. Notably, outliers evolve from transient spikes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Natural Language Processing Techniques
