WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

Kaixuan Zhang; Chutong Ding; Shiyou Qian; Luping Wang; Jian Cao; Guangtao Xue; Cheng Huang; Guodong Yang; Liping Zhang

arXiv:2604.10187·cs.PF·April 14, 2026

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

Kaixuan Zhang, Chutong Ding, Shiyou Qian, Luping Wang, Jian Cao, Guangtao Xue, Cheng Huang, Guodong Yang, Liping Zhang

PDF

TL;DR

WaveTune is a wave-aware auto-tuning framework that predicts GPU kernel latency accurately, enabling near-optimal configuration with minimal overhead for efficient LLM inference.

Contribution

It introduces a unified mapping, an analytical bilinear model, and a sparse sampling scheme to improve GPU kernel auto-tuning efficiency and accuracy.

Findings

01

Achieves up to 1.83x kernel speedup.

02

Reduces end-to-end TTFT by up to 1.33x.

03

Cuts runtime decision overhead by five orders of magnitude.

Abstract

The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix Multiplications (GEMMs). Although these kernels are highly optimized, their performance remains sensitive to a large space of runtime parameters, such as tile sizes and pipeline stages. The interaction between these parameters and hardware resources leads to a non-convex optimization landscape. Existing approaches to parameter configuration -- including search-based auto-tuning, heuristic rules, and learned cost models -- face a fundamental trade-off between performance optimality and runtime efficiency. In this paper, we present WaveTune, a wave-aware framework for runtime kernel auto-tuning. First, we introduce a unified mapping method to handle input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.