Reducing shared memory footprint to leverage high throughput on Tensor   Cores and its flexible API extension library

Hiroyuki Ootomo; Rio Yokota

arXiv:2308.15152·cs.DC·August 30, 2023

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

Hiroyuki Ootomo, Rio Yokota

PDF

Open Access 1 Repo

TL;DR

This paper presents a library extension that reduces shared memory usage to enhance Tensor Core throughput, achieving over 54 TFlop/s on NVIDIA A100 GPUs for matrix multiplication.

Contribution

The paper introduces a novel API extension library that reduces shared memory footprint and boosts Tensor Core performance for matrix multiplication.

Findings

01

Reduces shared memory footprint significantly.

02

Achieves 54.2 TFlop/s on A100 GPU for SGEMM.

03

Outperforms theoretical peak of FP32 SIMT cores.

Abstract

NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wmmae/wmma_extension
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Advanced Data Storage Technologies