Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Cole Ramos, Keith Lowery

TL;DR
Kerncap automates the extraction and isolation of GPU kernels from AMD applications, enabling rapid iteration and validation without full rebuilds, thus significantly accelerating GPU kernel development workflows.
Contribution
Kerncap introduces an automated tool that captures, isolates, and reproduces GPU kernels from AMD applications, integrating with HIP and Triton for efficient kernel tuning.
Findings
Successfully extracted kernels from large snapshots up to 30GB.
Achieved 13.6x speedup in kernel iteration workflow.
Validated kernels across multiple AMD GPU architectures and domains.
Abstract
Iterative GPU kernel tuning is bottlenecked by the scale of the applications that host the kernels. Rapid iteration requires isolating the kernel so it can be edited, recompiled, and validated without rebuilding the full application -- but manual isolation requires reconstructing build flags, dispatch configuration, and runtime inputs by hand, so developers usually settle for slow in-place edits. We present Kerncap, an automated kernel extraction tool that intercepts dispatches at the HSA runtime for both HIP and Triton, bridging Triton's JIT-only metadata into HSA-level capture via a lightweight Python compile-hook shim. Kerncap performs an address-space closure of all device memory -- a virtual-address-faithful snapshot that preserves embedded device pointers without DWARF metadata or pointer chasing -- locates kernel sources, and emits self-contained reproducer projects. HIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
