HiFloat4 Format for Language Model Pre-training on Ascend NPUs
Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak

TL;DR
This paper evaluates the HiFloat4 FP4 format for efficient large-scale language model training on Huawei Ascend NPUs, comparing it with other FP4 formats and exploring stabilization techniques.
Contribution
It provides a comprehensive empirical comparison of HiFloat4 with MXFP4 for large-scale training on NPUs, including stabilization methods for FP4.
Findings
HiFloat4 achieves comparable accuracy to full-precision training with stabilization.
FP4 formats significantly improve compute throughput and memory efficiency.
Stabilization techniques reduce numerical errors to within 1% of full-precision baselines.
Abstract
Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
