A Few Fit Most: Improving Performance Portability of SGEMM on GPUs using Multi-Versioning
Robert Hochgraf (1), Sreepathi Pai (2) ((1) Rochester Institute of Technology, (2) University of Rochester)

TL;DR
This paper presents a framework that uses multi-versioning to generate performance portable GEMM kernels for GPUs, eliminating the need for retuning across diverse devices and environments.
Contribution
The authors introduce a portability tuning framework that automatically creates multi-versioned GPU kernels, achieving near-optimal performance without retuning or environment-specific optimization.
Findings
Outperforms CLBlast's default kernels by up to 10% of theoretical maximum performance
Generates code that generalizes well to unseen devices
Eliminates retuning by maintaining performance portability
Abstract
Hand-optimizing linear algebra kernels for different GPU devices and applications is complex and labor-intensive. Instead, many developers use automatic performance tuning (autotuning) to achieve high performance on a variety of devices. However, autotuning "overfits", and must be redone if any part of the environment changes, such as if the device or input characteristics change. In most non-trivial cases, a single compute kernel cannot maintain near-optimal performance across all environments. Changing the kernel to specialize it to the current execution environment is possible, but on GPUs, runtime tuning and compilation can be expensive. In this work, we use multi-versioning -- producing several variants of the same code -- as a way to generate performance portable code. We describe a framework called portability tuning that can automatically generate multi-versioned code whose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
