NVIDIA Tensor Core Programmability, Performance & Precision
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng,, Jeffrey S. Vetter

TL;DR
This paper explores the programmability, performance, and precision aspects of NVIDIA Tensor Cores, demonstrating their high computational throughput and discussing the trade-offs in mixed precision calculations for HPC applications.
Contribution
It provides an experimental comparison of programming approaches for Tensor Cores and quantifies their performance and precision loss in practical scenarios.
Findings
Tensor Cores achieve up to 83 Tflops/s in mixed precision.
Performance is significantly higher than traditional single and half precision.
Precision loss can be mitigated with increased computation at higher cost.
Abstract
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
