A Few Fit Most: Improving Performance Portability of SGEMM on GPUs using Multi-Versioning

Robert Hochgraf (1); Sreepathi Pai (2) ((1) Rochester Institute of Technology; (2) University of Rochester)

arXiv:2507.15277·cs.PL·July 22, 2025

A Few Fit Most: Improving Performance Portability of SGEMM on GPUs using Multi-Versioning

Robert Hochgraf (1), Sreepathi Pai (2) ((1) Rochester Institute of Technology, (2) University of Rochester)

PDF

TL;DR

This paper presents a framework that uses multi-versioning to generate performance portable GEMM kernels for GPUs, eliminating the need for retuning across diverse devices and environments.

Contribution

The authors introduce a portability tuning framework that automatically creates multi-versioned GPU kernels, achieving near-optimal performance without retuning or environment-specific optimization.

Findings

01

Outperforms CLBlast's default kernels by up to 10% of theoretical maximum performance

02

Generates code that generalizes well to unseen devices

03

Eliminates retuning by maintaining performance portability

Abstract

Hand-optimizing linear algebra kernels for different GPU devices and applications is complex and labor-intensive. Instead, many developers use automatic performance tuning (autotuning) to achieve high performance on a variety of devices. However, autotuning "overfits", and must be redone if any part of the environment changes, such as if the device or input characteristics change. In most non-trivial cases, a single compute kernel cannot maintain near-optimal performance across all environments. Changing the kernel to specialize it to the current execution environment is possible, but on GPUs, runtime tuning and compilation can be expensive. In this work, we use multi-versioning -- producing several variants of the same code -- as a way to generate performance portable code. We describe a framework called portability tuning that can automatically generate multi-versioned code whose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.