Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Yuxuan Han; Meng-Hao Guo; Zhengning Liu; Wenguang Chen; Shi-Min Hu

arXiv:2603.07169·cs.LG·March 10, 2026

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Yuxuan Han, Meng-Hao Guo, Zhengning Liu, Wenguang Chen, Shi-Min Hu

PDF

Open Access

TL;DR

This paper introduces CUDAMaster, a system that automates GPU kernel optimization across multiple scenarios, significantly improving performance and surpassing existing methods like Astra and cuBLAS.

Contribution

It develops MSKernelBench, a comprehensive benchmark for multi-scenario GPU kernels, and proposes CUDAMaster, a hardware-aware, multi-agent system for automated kernel optimization.

Findings

01

CUDAMaster achieves about 35% speedup over Astra.

02

It matches or exceeds the performance of cuBLAS in several cases.

03

The benchmark covers diverse applications including scientific computing routines.

Abstract

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general-purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi-scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications