GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

Ruifan Chu; Anbang Wang; Xiuxiu Bai; Shuai Liu; Xiaoshe Dong

arXiv:2512.22147·cs.DC·December 30, 2025

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

Ruifan Chu, Anbang Wang, Xiuxiu Bai, Shuai Liu, Xiaoshe Dong

PDF

Open Access

TL;DR

This paper introduces an LLM-based framework that optimizes GPU kernels by creating minimal executable programs, enabling efficient, cross-platform kernel tuning without full application builds, achieving significant speedups.

Contribution

It presents a novel end-to-end LLM framework that automatically completes, optimizes, and validates GPU kernels as minimal executable programs without full application recompilation.

Findings

01

Achieves up to 7.77x speedup on benchmark kernels

02

Reduces search cost through reuse of optimization strategies

03

Enables cross-platform GPU kernel optimization without full builds

Abstract

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large applications where full builds and runs are expensive. We present an end-to-end LLM framework with performance feedback that optimizes kernels without building the full application. From independently extracted hotspot kernels, it automatically completes code into a Minimal Executable Program (MEP), then performs multi-round iterative optimization and evaluation outside the full application. The framework integrates Automatic Error Repair and Performance Pattern Inheritance to fix faults, preserve correctness, reuse effective tiling/memory/synchronization strategies, and reduce search cost. Optimized variants are reintegrated into the original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy