TL;DR
This paper introduces a compiler framework with three key optimizations—thresholding, coarsening, and aggregation—to improve the performance of dynamic parallelism on GPUs, especially for irregular nested workloads.
Contribution
It presents a novel compiler framework that optimizes dynamic parallelism on GPUs through thresholding, coarsening, and aggregation techniques, reducing performance penalties.
Findings
Achieves a 43.0x geometric mean speedup over non-optimized dynamic parallelism.
Attains an 8.7x speedup over applications without dynamic parallelism.
Provides a 3.6x improvement over prior dynamic parallelism aggregation methods.
Abstract
Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization. To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
