Implementing implicit OpenMP data sharing on GPUs
Gheorghe-Teodor Bercea, Carlo Bertolli, Arpith C. Jacob, Alexandre, Eichenberger, Alexey Bataev, Georgios Rokos, Hyojin Sung, Tong Chen, Kevin, O'Brien

TL;DR
This paper redesigns OpenMP's data sharing on GPUs by mapping implicitly shared local variables to shared memory, improving memory management and compatibility with CUDA semantics.
Contribution
It introduces a new data sharing infrastructure in Clang/LLVM that maps implicit local variable sharing to GPU shared memory, addressing CUDA-OpenMP semantic differences.
Findings
Low shared memory usage for scalar variables (under 26%)
No negative impact on GPU occupancy
Effective control over implicit shared memory allocation
Abstract
OpenMP is a shared memory programming model which supports the offloading of target regions to accelerators such as NVIDIA GPUs. The implementation in Clang/LLVM aims to deliver a generic GPU compilation toolchain that supports both the native CUDA C/C++ and the OpenMP device offloading models. There are situations where the semantics of OpenMP and those of CUDA diverge. One such example is the policy for implicitly handling local variables. In CUDA, local variables are implicitly mapped to thread local memory and thus become private to a CUDA thread. In OpenMP, due to semantics that allow the nesting of regions executed by different numbers of threads, variables need to be implicitly \emph{shared} among the threads of a contention group. In this paper we introduce a re-design of the OpenMP device data sharing infrastructure that is responsible for the implicit sharing of local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
