CUDA-LLM: LLMs Can Write Efficient CUDA Kernels
Wentao Chen, Jiace Zhu, Qi Fan, Yehan Ma, An Zou

TL;DR
This paper introduces CUDA-LLM, a framework that uses large language models combined with a novel optimization method to generate high-performance, architecture-aware CUDA kernels that outperform human-written code in speed.
Contribution
The paper presents FSR, a new framework that jointly optimizes correctness and performance of CUDA code generated by LLMs, enabling automated, efficient GPU kernel creation.
Findings
LLMs with FSR achieve high correctness rates.
Generated kernels outperform human code by up to 179× in speed.
Framework effectively tailors CUDA code to specific GPU architectures.
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing high-performance GPU kernels that fully exploit the underlying hardware. To address this challenge, we propose a novel framework called \textbf{Feature Search and Reinforcement (FSR)}. FSR jointly optimizes compilation and functional correctness, as well as the runtime performance, which are validated through extensive and diverse test cases, and measured by actual kernel execution latency on the target GPU, respectively. This approach enables LLMs not only to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Big Data and Digital Economy
