CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Weinan Dai; Hanlin Wu; Qiying Yu; Huan-ang Gao; Jiahao Li; Chengquan Jiang; Weiqiang Lou; Yufan Song; Hongli Yu; Jiaze Chen; Wei-Ying Ma; Ya-Qin Zhang; Jingjing Liu; Mingxuan Wang; Xin Liu; Hao Zhou

arXiv:2602.24286·cs.LG·March 2, 2026

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou

PDF

Open Access

TL;DR

CUDA Agent introduces a large-scale reinforcement learning system that significantly improves CUDA kernel generation, outperforming existing models and compiler-based systems in speed and optimization quality.

Contribution

It presents a novel agentic RL framework with scalable data synthesis, automated verification, and stable training techniques for CUDA kernel optimization.

Findings

01

Achieves 100% faster performance than torch.compile on KernelBench levels.

02

Outperforms proprietary models by about 40% on the hardest benchmark level.

03

Demonstrates state-of-the-art results in CUDA kernel optimization.

Abstract

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques