SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines

Weicheng Xue; Baisong Xu; Kai Yang; Yongxiang Liu; Dengdeng Fan; Pengxiang Xu; Yonghong Tian

arXiv:2507.23387·cs.DC·May 7, 2026

SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines

Weicheng Xue, Baisong Xu, Kai Yang, Yongxiang Liu, Dengdeng Fan, Pengxiang Xu, Yonghong Tian

PDF

TL;DR

SGEMM-cube is a novel approach that enables FP32-accuracy matrix multiplication approximation on Ascend NPUs using FP16 units, balancing accuracy and high performance.

Contribution

It introduces an architecture-specific scheme for FP32 approximation on FP16 hardware, with detailed analysis and optimization for Ascend NPUs.

Findings

01

Achieves up to 65.3 TFLOP/s performance on Ascend 910A.

02

Recovers substantially higher accuracy than native FP16 GEMM.

03

Approaches FP32 GEMM accuracy for moderate-range inputs.

Abstract

Modern AI accelerators provide high-throughput low-precision matrix engines, but their support for FP32 GEMM is often limited or inefficient. This work presents SGEMM-cube, a precision-recovery FP32 GEMM approximation on Ascend NPUs using FP16 Cube units. Rather than claiming bit-exact FP32 approximation, SGEMM-cube targets near-FP32 accuracy for inputs whose magnitudes are representable within the FP16 dynamic range. The method follows a two-component FP32-to-FP16 splitting strategy related to Ozaki-style and Ootomo-style schemes: each FP32 operand is represented by an FP16 high component and a scaled FP16 residual component, and the matrix product is reconstructed from the dominant high-high and high-low terms while omitting the low-low term. The main contribution of this paper is not a new splitting paradigm, but an architecture-specific realization and analysis of this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.