TL;DR
This paper introduces flexible, high-performance GEMM kernels on GPUs using Julia, enabling researchers to extend algorithms easily without sacrificing performance, bridging the gap between productivity and flexibility.
Contribution
It presents novel abstractions and interfaces for GEMMs in Julia, achieving performance comparable to or better than existing libraries while enhancing flexibility and ease of extension.
Findings
Performance comparable to cuBLAS and CUTLASS
Achieves high flexibility for algorithm extension
No need for low-level CUDA programming
Abstract
General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries' lack of flexibility limits the freedom to explore new algorithms. Researchers using GEMMs can hence not enjoy programming productivity, high performance, and research flexibility at once. In this paper we solve this problem. We present three sets of abstractions and interfaces to program GEMMs within the scientific Julia programming language. The interfaces and abstractions are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
