Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance
Hiroyuki Ootomo, Rio Yokota

TL;DR
This paper presents a novel method to recover full FP32 accuracy from Tensor Cores on NVIDIA GPUs, surpassing their theoretical peak performance while maintaining high throughput and low power consumption.
Contribution
The authors develop a high-accuracy, high-performance matrix multiplication method on Tensor Cores that matches FP32 accuracy and exceeds the theoretical peak throughput of FP32 cores.
Findings
Achieves 51 TFlop/s with FP16 Tensor Cores for limited exponent range
Achieves 33 TFlop/s with TF32 Tensor Cores for full exponent range
Outperforms the theoretical FP32 SIMT core peak performance of 19.5 TFlop/s
Abstract
Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix-matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Tensor decomposition and applications
