CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li; Xiaofei Sun; Albert Wang; Jiwei Li; Chris Shum

arXiv:2507.14111·cs.AI·February 4, 2026

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

PDF

Open Access 1 Datasets

TL;DR

CUDA-L1 introduces a contrastive reinforcement learning framework that significantly enhances CUDA kernel optimization, achieving substantial speedups and uncovering fundamental principles, thereby advancing automated GPU performance tuning.

Contribution

The paper presents CUDA-L1, a novel contrastive RL approach that automates CUDA optimization, outperforming existing methods and revealing new insights into CUDA performance improvements.

Findings

01

Achieves an average speedup of 3.12x on CUDA kernels.

02

Outperforms Torch Compile, CUDA Graph, and cuDNN libraries.

03

Discovers and strategically combines CUDA optimization techniques.

Abstract

The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

deepreinforce-ai/CUDA-L1
dataset· 111 dl
111 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Graph Theory and Algorithms