Recovering single precision accuracy from Tensor Cores while surpassing   the FP32 theoretical peak performance

Hiroyuki Ootomo; Rio Yokota

arXiv:2203.03341·cs.DC·October 19, 2023·Int. J. High Perform. Comput. Appl.

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Hiroyuki Ootomo, Rio Yokota

PDF

Open Access

TL;DR

This paper presents a novel method to recover full FP32 accuracy from Tensor Cores on NVIDIA GPUs, surpassing their theoretical peak performance while maintaining high throughput and low power consumption.

Contribution

The authors develop a high-accuracy, high-performance matrix multiplication method on Tensor Cores that matches FP32 accuracy and exceeds the theoretical peak throughput of FP32 cores.

Findings

01

Achieves 51 TFlop/s with FP16 Tensor Cores for limited exponent range

02

Achieves 33 TFlop/s with TF32 Tensor Cores for full exponent range

03

Outperforms the theoretical FP32 SIMT core peak performance of 19.5 TFlop/s

Abstract

Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix-matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Tensor decomposition and applications