Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak; Mantas Mikaitis

arXiv:2512.07004·cs.MS·April 7, 2026

Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak, Mantas Mikaitis

PDF

TL;DR

This paper develops software models to accurately emulate the numerical behavior of NVIDIA Tensor Cores across various GPU architectures and input formats, addressing inconsistencies in hardware implementation.

Contribution

It introduces detailed software models for low- and mixed-precision matrix multipliers on multiple NVIDIA GPU architectures, improving reproducibility and analysis.

Findings

01

Models replicate hardware rounding and normalization behaviors.

02

Models cover 8-, 16-, and 19-bit floating point formats.

03

Enable consistent testing across different GPU generations.

Abstract

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers -- such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others -- test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.