Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision
Evelyne Ringoot, Rabab Alomairy, Valentin Churavy, Alan Edelman

TL;DR
This paper introduces a portable, GPU-accelerated SVD implementation in Julia that supports diverse hardware and data types, including Apple Metal GPUs and half precision, achieving high performance across platforms.
Contribution
It presents the first GPU-accelerated SVD supporting Apple Metal GPUs and half precision, with a unified, hardware-agnostic implementation in Julia.
Findings
Outperforms most linear algebra libraries for large matrices
Supports diverse GPU architectures and data types
Achieves 80%-90% of cuSOLVER performance on large matrices
Abstract
This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
