Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Siva Kumar Sastry Hari; Vignesh Balaji; Sana Damani; Qijing Huang; Christos Kozyrakis

arXiv:2603.29010·cs.LG·April 1, 2026

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, Christos Kozyrakis

PDF

TL;DR

This paper introduces μCUTLASS, a domain-specific language and guidance method that significantly improves GPU kernel optimization efficiency by reducing trials and resource usage, outperforming baseline agents.

Contribution

The paper proposes a novel DSL and speed-of-light guidance to enhance GPU kernel optimization, enabling faster, more cost-effective search with better performance results.

Findings

01

Switching to DSL code with GPT-5-mini yields 1.27x speedup over PyTorch.

02

Adding SOL guidance increases speedup to 1.56x.

03

SOL-guided budgeting reduces token use by 19-43% while maintaining 95% of speedup.

Abstract

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.