Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Manyi Zhang, Yuanyong Luo, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong

TL;DR
This paper evaluates HiFloat low-bit floating-point formats on Ascend NPUs, demonstrating their advantages in efficiency and accuracy for large language model inference across various data types and tasks.
Contribution
It introduces and comprehensively evaluates HiFloat formats tailored for NPUs, highlighting their compatibility with quantization frameworks and their effectiveness in different data regimes.
Findings
INT8 suits narrow-range data, floating-point excels with high-variance data
HiF4's hierarchical scaling prevents accuracy collapse in 4-bit regimes
HiFloat is compatible with state-of-the-art quantization frameworks
Abstract
As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Numerical Methods and Algorithms
