WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
Kaixuan Zhang, Chutong Ding, Shiyou Qian, Luping Wang, Jian Cao, Guangtao Xue, Cheng Huang, Guodong Yang, Liping Zhang

TL;DR
WaveTune is a wave-aware auto-tuning framework that predicts GPU kernel latency accurately, enabling near-optimal configuration with minimal overhead for efficient LLM inference.
Contribution
It introduces a unified mapping, an analytical bilinear model, and a sparse sampling scheme to improve GPU kernel auto-tuning efficiency and accuracy.
Findings
Achieves up to 1.83x kernel speedup.
Reduces end-to-end TTFT by up to 1.33x.
Cuts runtime decision overhead by five orders of magnitude.
Abstract
The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix Multiplications (GEMMs). Although these kernels are highly optimized, their performance remains sensitive to a large space of runtime parameters, such as tile sizes and pipeline stages. The interaction between these parameters and hardware resources leads to a non-convex optimization landscape. Existing approaches to parameter configuration -- including search-based auto-tuning, heuristic rules, and learned cost models -- face a fundamental trade-off between performance optimality and runtime efficiency. In this paper, we present WaveTune, a wave-aware framework for runtime kernel auto-tuning. First, we introduce a unified mapping method to handle input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
