Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Pengxiang Zhao; Hui-Ling Zhen; Xing Li; Han Bao; Weizhe Lin; Zhiyuan Yang; Manyi Zhang; Yuanyong Luo; Ziwei Yu; Xin Wang; Mingxuan Yuan; Xianzhi Yu; Zhenhua Dong

arXiv:2602.12635·cs.CL·March 3, 2026

Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Manyi Zhang, Yuanyong Luo, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong

PDF

Open Access

TL;DR

This paper evaluates HiFloat low-bit floating-point formats on Ascend NPUs, demonstrating their advantages in efficiency and accuracy for large language model inference across various data types and tasks.

Contribution

It introduces and comprehensively evaluates HiFloat formats tailored for NPUs, highlighting their compatibility with quantization frameworks and their effectiveness in different data regimes.

Findings

01

INT8 suits narrow-range data, floating-point excels with high-variance data

02

HiF4's hierarchical scaling prevents accuracy collapse in 4-bit regimes

03

HiFloat is compatible with state-of-the-art quantization frameworks

Abstract

As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Numerical Methods and Algorithms