Boosting Performance of Iterative Applications on GPUs: Kernel Batching   with CUDA Graphs

Jonah Ekelund; Stefano Markidis; Ivy Peng

arXiv:2501.09398·cs.DC·May 1, 2025

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Jonah Ekelund, Stefano Markidis, Ivy Peng

PDF

Open Access

TL;DR

This paper introduces a kernel batching strategy using CUDA Graphs to optimize iterative GPU applications, significantly reducing kernel launch overhead and achieving over 1.4x speed-up in various benchmarks.

Contribution

It proposes a novel method to batch iterative kernel launches into CUDA Graphs, optimizing performance and providing a generalized approach for iterative solvers.

Findings

01

Optimal batch size balances overhead and performance gain.

02

Over 1.4x speed-up achieved in benchmarks.

03

Applicable to various iterative applications and solvers.

Abstract

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques · Recommender Systems and Techniques