Accurate Models of NVIDIA Tensor Cores
Faizan A. Khattak, Mantas Mikaitis

TL;DR
This paper develops software models to accurately emulate the numerical behavior of NVIDIA Tensor Cores across various GPU architectures and input formats, addressing inconsistencies in hardware implementation.
Contribution
It introduces detailed software models for low- and mixed-precision matrix multipliers on multiple NVIDIA GPU architectures, improving reproducibility and analysis.
Findings
Models replicate hardware rounding and normalization behaviors.
Models cover 8-, 16-, and 19-bit floating point formats.
Enable consistent testing across different GPU generations.
Abstract
Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers -- such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others -- test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
