HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Mehran Taghian; Yunke Peng; Xing Huang; Yao Wang; Yaoyuan Wang; Wei Guo; Yuanyong Luo; Tianchi Hu; Junsong Wang; Xin Wang; Hu Liu; Yu Cheng; Ziwei Yu; Hongliang Li; Mehdi Rahimifar; Lei Yan; Xuefei Wang; Zhuang Ma; Lei Liu; Hui Yu; Anandharaju Durai Raju; Hoang Le; Hei Yi Mak; Tanzila Rahman; Shadan Golestan

arXiv:2604.08826·cs.LG·April 13, 2026

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak

PDF

TL;DR

This paper evaluates the HiFloat4 FP4 format for efficient large-scale language model training on Huawei Ascend NPUs, comparing it with other FP4 formats and exploring stabilization techniques.

Contribution

It provides a comprehensive empirical comparison of HiFloat4 with MXFP4 for large-scale training on NPUs, including stabilization methods for FP4.

Findings

01

HiFloat4 achieves comparable accuracy to full-precision training with stabilization.

02

FP4 formats significantly improve compute throughput and memory efficiency.

03

Stabilization techniques reduce numerical errors to within 1% of full-precision baselines.

Abstract

Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.