A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Qiuyi Qu; Yicheng Sui; Yufei Sun; Rui Chen; Xiaofei Zhang; Yuzhi Zhang; Haofeng Wang; Ge Lan

arXiv:2601.12698·cs.CL·January 26, 2026

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Qiuyi Qu, Yicheng Sui, Yufei Sun, Rui Chen, Xiaofei Zhang, Yuzhi Zhang, Haofeng Wang, Ge Lan

PDF

Open Access

TL;DR

This paper presents a two-stage GPU kernel tuning method combining semantic refactoring into parameterized templates with search-based autotuning, achieving stable, high-performance kernel optimization.

Contribution

It introduces a template-based rewriting layer combined with search-based autotuning to improve stability and interpretability of GPU kernel optimization.

Findings

01

Achieved over 3x speedup on real-world kernels

02

Reduced randomness in iterative optimization process

03

Enhanced stability and interpretability of kernel tuning

Abstract

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Distributed and Parallel Computing Systems