Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs
Jonah Ekelund, Stefano Markidis, Ivy Peng

TL;DR
This paper introduces a kernel batching strategy using CUDA Graphs to optimize iterative GPU applications, significantly reducing kernel launch overhead and achieving over 1.4x speed-up in various benchmarks.
Contribution
It proposes a novel method to batch iterative kernel launches into CUDA Graphs, optimizing performance and providing a generalized approach for iterative solvers.
Findings
Optimal batch size balances overhead and performance gain.
Over 1.4x speed-up achieved in benchmarks.
Applicable to various iterative applications and solvers.
Abstract
Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques · Recommender Systems and Techniques
