Kevin: Multi-Turn RL for Generating CUDA Kernels
Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti

TL;DR
This paper introduces Kevin, a multi-turn reinforcement learning model designed to generate and optimize CUDA GPU kernels, significantly improving correctness and speedup over baseline models by effectively leveraging iterative feedback.
Contribution
Kevin is the first model trained with multi-turn RL for CUDA kernel generation, explicitly modeling the iterative refinement process in GPU programming.
Findings
Correctness improved from 56% to 82%.
Mean speedup increased from 0.53x to 1.10x.
Scaling serial refinement yields better results than parallel sampling.
Abstract
Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean…
Peer Reviews
Decision·ICLR 2026 Poster
- This is my first time encountering multi-turn in the context of code generation. The results support its usage from both a correctness and performance perspective. - The authors investigated multiple aspects of the RL training procedure, from the generation of the training data, reward assignment, and the composition and structure of samples during training. All of these training hyperparameters were shown to have a notable impact on the quality of the result and should provide a useful data
- I found the discussion of the choice of baseline model in Appendix B.6 to be insufficient from the reader's perspective. While it seems completely plausible for the largest model to have the best priors and smaller models more susceptible to reward hacking, it may be the case that certain updates to the reward function or training could alleviate these issues. - One of the major limitations, noted by the authors, is the limited number of robust tasks usable for training. With access to more ta
- Demonstration of multi-turn RL for GPU kernel generation. - RL reward formulation as a function of correctness and speedup. - Effectiveness of multi-turn RL to improve inference time scaling trends. - Clearly comparing gains against single-turn RL.
- The evaluation methodology does not follow standard practices. Kevin trains on 180/200 examples from kernelbench evaluation benchmark. The authors must create their own dataset for training and then evaluate on kernelbench. This does not inspire fair comparison with existing approaches and goes against standard practices. - Paper does not specify where are the initial CoTs obtained from. - In section 4 line 237, the description is unclear. - In subsection 4.1: Summarizing all previous CoTs d
**Realistic problem setup.** The work models how GPU performance engineers actually iterate: propose kernel → profile → refine. The RL formulation explicitly credits early partial attempts that later lead to a fast kernel, instead of rewarding only final outputs. **Clear engineering advances.** They introduce (i) per-turn training on every refinement step to improve sample efficiency and (ii) discounted future-return style reward aggregation for credit assignment across turns. **Beating stro
**Performance gains are marginal vs the main ablation.** Compared to the single-turn RL baseline, Kevin does not improve solve rate: correctness best@16 stays 82% vs 82%, and fast1 best@16 stays 43% vs 43%. The main improvement is higher best-case speedup (1.10× vs 0.85×), and average speedup only rises slightly (0.40× vs 0.35×). So the method is not solving more tasks; it’s mostly squeezing somewhat better runtime on tasks that were already solvable. **Heavy system heuristics, limited theory.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Fuzzy Logic and Control Systems · Neural Networks and Applications
