CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Songqiao Su; Xiaofei Sun; Xiaoya Li; Albert Wang; Jiwei Li; Chris Shum

arXiv:2512.02551·cs.LG·December 15, 2025

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum

PDF

Open Access 1 Datasets

TL;DR

CUDA-L2 leverages reinforcement learning and large language models to automatically optimize matrix multiplication kernels, surpassing existing libraries like cuBLAS in speed, especially in real-time inference scenarios.

Contribution

This work introduces CUDA-L2, a novel RL-based system that systematically optimizes HGEMM CUDA kernels, achieving significant performance improvements over state-of-the-art libraries.

Findings

01

+22.0% speedup over torch.matmul in offline mode

02

+19.2% speedup over cuBLAS with optimal layout

03

+28.7% speedup in server mode

Abstract

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art Nvidia's closed-source libraries, i.e., cuBLAS, cuBLASLt. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over torch.matmul on average; +19.2% over cuBLAS using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over cuBLASLt-heuristic, which queries cuBLASLt library and selects the algorithm based on the heuristic's suggestion; and +11.4% over the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

deepreinforce-ai/CUDA-L2
dataset· 105 dl
105 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Multimodal Machine Learning Applications