AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun

TL;DR
AutoTriton leverages reinforcement learning to automate Triton GPU kernel programming, reducing manual tuning and achieving performance comparable to large models, thus paving the way for more efficient AI systems.
Contribution
First RL-based model for automatic Triton kernel programming, combining supervised fine-tuning and reinforcement learning to improve performance and ease of GPU kernel development.
Findings
AutoTriton achieves performance comparable to large models.
The RL and SFT modules are crucial for AutoTriton's success.
Reward design significantly impacts kernel optimization.
Abstract
Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a…
Peer Reviews
Decision·Submitted to ICLR 2026
+ It addresses a critical bottleneck in modern AI infrastructure: the manual effort and expertise required to write efficient GPU kernels. The integration of LLMs with feedback-driven autotuning aligns well with current trends in AI-assisted compiler optimization and code synthesis. + The framework combines static analysis, runtime profiling, and iterative LLM prompting, forming a tight optimization loop. Its error recovery and retry mechanism allows it to gracefully handle compilation or runti
- The system still relies heavily on the LLM’s prompt design and prior exposure to Triton code examples. Although feedback improves results, semantic misunderstandings (e.g., incorrect indexing or boundary handling) remain common in early iterations. - While practical, the paper lacks a deeper theoretical model for convergence or performance bounds of its iterative improvement process. There’s no formal guarantee that feedback-driven refinement leads to monotonic performance improvement. - Eva
- The process of automatically generating a training dataset to support both the SFT and RL phases is a great direction to pursue. The SFT and RL results on the KernelBench suite demonstrate the effectiveness of the approach to impact kernel generation. - The decomposition of the gains into those achieved through SFT and RL is interesting and acknowledges the strengths of both approaches used independently and in tandem. - The results illustrate that although AutoTriton is based on an 8B that it
- Although Triton is a useful DSL it wasn't clear to me why the techniques wouldn't be equally applicable to several programming languages that are of interest to the ML community. - While the comparisons with other LLMs do support the author's claims, they compare a fine-tuned model with a model that is not tuned for Triton coding. From that perspective, I would expect AutoTriton to do better than these baseline models. - Outside of additions to support disuade the model from reward hacking, th
S1: Well-designed pipeline: The paper presents a systematic end-to-end data pipeline for high-quality Triton kernel collection and verification. S2: Quantitative validation of RL gains: Clear ablation results show consistent improvement of RL over SFT-only baselines (Tables 1 and 2). S3: Detailed experimental setup: Evaluation protocols and hyperparameters are thoroughly described, ensuring reproducibility.
W1: Limited novelty: Prior works (e.g., AI CUDA Engineer) already apply RL or agentic loops to CUDA kernel optimization. The main contribution appears to be applying similar ideas to Triton rather than introducing new RL methodology. W2: Incomplete reward design: The RL reward focuses on functional correctness but does not explicitly optimize runtime speed, as the authors themselves note. W3: Unclear difference from CUDA approaches: The paper lacks deeper analysis of what makes Triton-specific o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Network Packet Processing and Optimization
MethodsShrink and Fine-Tune
