CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
Tara Saba, Anne Ouyang, Xujie Si, Fan Long

TL;DR
CuTeGen is an agentic framework that uses LLMs to iteratively generate, test, and refine GPU kernels, achieving high performance and correctness through structured optimization and validation.
Contribution
It introduces a structured generate--test--refine workflow for GPU kernel development using LLMs and the CuTe abstraction layer for progressive optimization.
Findings
Produces functionally correct kernels for matrix multiplication and activation workloads.
Achieves competitive performance compared to optimized libraries.
Utilizes execution-based validation and staged optimization for kernel refinement.
Abstract
High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
