Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement
Junjie Li

TL;DR
This paper presents SCILIB-Accel, a tool that automatically offloads all level-3 BLAS operations to GPUs using unified memory and a First-Use data movement policy, achieving significant speedups without code modifications.
Contribution
It extends automatic BLAS offloading to all level-3 operations on unified memory architectures and introduces a novel data movement policy inspired by OpenMP First-Touch.
Findings
Achieved up to 3x speedup on quantum physics codes.
Demonstrated effectiveness on large-scale GPU clusters.
No code modifications required for offloading.
Abstract
BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Advanced Data Storage Technologies
