Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU
Hancheng Wu, Da Li, Michela Becchi

TL;DR
This paper introduces compiler techniques to optimize dynamic parallelism on GPUs, significantly reducing overhead and boosting performance for complex parallel algorithms with irregular or recursive structures.
Contribution
It proposes three workload consolidation schemes implemented in a directive-based compiler to enhance GPU utilization for dynamic parallelism applications.
Findings
Achieved up to 3300x speedup over naive DP solutions
Reduced runtime overhead of DP-based codes
Improved GPU utilization for irregular and recursive algorithms
Abstract
GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset. However, the effective use of GPUs for algorithms exhibiting complex patterns of parallelism, possibly known only at runtime, is still an open problem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs. By making it possible to launch kernels directly from GPU threads, this feature enables nested parallelism at runtime. However, the effective use of DP must still be understood: a naive use of this feature may suffer from significant runtime overhead and lead to GPU underutilization, resulting in poor performance. In this work, we target this problem. First, we demonstrate how a naive use of DP can result in poor performance. Second,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
