Flexible Performant GEMM Kernels on GPUs

Thomas Faingnaert; Tim Besard; Bjorn De Sutter

arXiv:2009.12263·cs.MS·November 23, 2021

Flexible Performant GEMM Kernels on GPUs

Thomas Faingnaert, Tim Besard, Bjorn De Sutter

PDF

2 Repos

TL;DR

This paper introduces flexible, high-performance GEMM kernels on GPUs using Julia, enabling researchers to extend algorithms easily without sacrificing performance, bridging the gap between productivity and flexibility.

Contribution

It presents novel abstractions and interfaces for GEMMs in Julia, achieving performance comparable to or better than existing libraries while enhancing flexibility and ease of extension.

Findings

01

Performance comparable to cuBLAS and CUTLASS

02

Achieves high flexibility for algorithm extension

03

No need for low-level CUDA programming

Abstract

General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries' lack of flexibility limits the freedom to explore new algorithms. Researchers using GEMMs can hence not enjoy programming productivity, high performance, and research flexibility at once. In this paper we solve this problem. We present three sets of abstractions and interfaces to program GEMMs within the scientific Julia programming language. The interfaces and abstractions are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.