Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

Junjie Li

arXiv:2501.00279·cs.DC·August 14, 2025

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

Junjie Li

PDF

Open Access

TL;DR

This paper presents SCILIB-Accel, a tool that automatically offloads all level-3 BLAS operations to GPUs using unified memory and a First-Use data movement policy, achieving significant speedups without code modifications.

Contribution

It extends automatic BLAS offloading to all level-3 operations on unified memory architectures and introduces a novel data movement policy inspired by OpenMP First-Touch.

Findings

01

Achieved up to 3x speedup on quantum physics codes.

02

Demonstrated effectiveness on large-scale GPU clusters.

03

No code modifications required for offloading.

Abstract

BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Advanced Data Storage Technologies