CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding

TL;DR
CudaForge introduces a training-free multi-agent framework utilizing LLMs and hardware feedback to automatically generate and optimize CUDA kernels, achieving high correctness, speedup, and cost efficiency across diverse hardware and models.
Contribution
It presents a novel multi-agent, training-free workflow with hardware feedback for CUDA kernel optimization, surpassing existing methods in efficiency, correctness, and generalization.
Findings
Achieves 97.6% correctness in generated kernels.
Provides an average 1.68× speedup over PyTorch baselines.
Cost-effective with about $0.3 API cost per kernel.
Abstract
Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating…
Peer Reviews
Decision·Submitted to ICLR 2026
The inclusion of hardware execution feedback seems like a critical step in improving autogenerated kernels, and the investigation into which metrics actually matter as feedback is generally useful beyond this paper (not only to limit the context to be put into the LLM, but also, gathering reduced statistics makes profiling faster). The method achieves very good correctness scores, while being computationally cheaper than competitors; in particular, it is training-free. Having an example of how
Unfortunately, KernelBench, as a benchmark, is quite flawed, because many of its tasks use shapes that are too small, exacerbating the overheads induced by not using torch.compile as the baseline. It seems hard to believe that on something as essential as cross-entropy, there'd be a 4x speedup left on the table; Figure 4 suggests that the framework is doing something promising, but I am very skeptical about the reported speedups reflecting meaningful scenarios. I'm not sure the "Comparison wi
1. Originality Incorporating nsight compute profiling data is an effective approach. The outcome is highly verifiable. This bridges a gap between abstract code generation and hardware-aware tuning, mimicking expert workflows in a systematic way. The separation of roles into a Coder and Judge are sound and understoodable. CudaForge performs optimization purely at inference time, showing that meaningful performance gains are achievable without learning-based fine-tuning. 2. Quality The evaluation
the overall multi-agent refinement structure follows a familiar template used in prior agent-based code generation frameworks using self-refine. The core advance lies in the feedback modality rather than a fundamentally new learning or reasoning principle. The paper draws a hard line between training-free and RL-based paradigms but doesn’t explore hybrid approaches. In practice, the line can be blurred. If the kernel perf is verifable, it's possible to train the model for better answers While
- Propose train-free iterative refinement based approach for CUDA kernel generation. - Demonstrate the use of LLMs with different identities (coder & judge) in producing performant CUDA kernels. - Provide a systematic methodology to extract and refine the output of Nsight profiler. - Provide detailed analysis of related methods and bring forth interesting observations. - Method has been shown to work across various frontier and open source models.
- Efficacy of this approach is not demonstrated by authors on popular but low resource languages such as Triton. - Unlike other approaches, there is no clear methodology of evaluation specified in the paper. Precise evaluation setup is extremely important in such tasks. - Performance measurement with native pytorch implementation without torch.compile does not reflect a comparison with a true baseline.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques
