TritonRL: Training LLMs to Think and Code Triton Without Cheating
Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, Youngsuk Park

TL;DR
TritonRL is a specialized 8B-scale LLM trained with a novel reinforcement learning framework to generate correct and efficient Triton kernels, overcoming data scarcity and reward hacking issues.
Contribution
The paper introduces TritonRL, a domain-specific LLM trained via a new RL approach with hierarchical reward decomposition and verification, improving kernel correctness and performance.
Findings
Achieves state-of-the-art correctness on KernelBench
Outperforms Triton-specific models in speed and accuracy
Matches performance of larger models with over 100B parameters
Abstract
The rapid evolution of Large Language Models (LLMs) has driven a growing demand for automated, high-performance system kernels to accelerate machine learning workloads. We introduce TritonRL, a domain-specialized 8B-scale LLM for Triton programming, trained via a novel reinforcement learning (RL) framework. While Triton synthesis faces unique challenges, including data scarcity and a high susceptibility to reward hacking, our approach enables robust kernel generation through two primary innovations. First, we implement a multi-layered verification system that provides high-fidelity reward signals, ensuring that generated kernels are both syntactically and functionally valid. Second, we propose Hierarchical Reward Decomposition (HRD), which decouples reinforcement for high-level reasoning and low-level implementation to resolve the credit assignment problem in long-sequence generation.…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper proposes a technically sound and well-motivated approach combining supervised finetuning and RL with curated training data. - In experiment, the proposed approach outperforms other open Triton kernel generation models, and is comparable to proprietary LLMs like Claude, which demonstrates its effectiveness. - Writing is easy to understand.
- Even though writing is easy to understand, it lacks an important part. There is no related work presented, which makes it difficult to assess the paper with respect to the literature. - As this reviewer is new to Triton kernel generation task, it was not clearly stated why the task is important and how it is different from other tasks like CUDA code generation. Therefore the significance of the approach was not well demonstrated. - Another major concern is novelty. Even though the proposed a
1. The paper is overall well-written and clearly organized. It effectively frames the core challenge of reward hacking in kernel generation, where models superficially pass tests by calling high-level libraries rather than writing valid kernels. 2. The proposed solutions are mostly clear. The robust verifier combines rules and an LLM-judge to ensure functional validity. The hierarchical reward decomposition is designed for credit assignment, decoupling rewarding speedup from correctness. 3. The
1. While the robust verifier is a practical and necessary engineering solution for this specific problem, the core method (combining rule-based heuristics with an LLM-judge to prevent reward hacking) is not a particularly novel concept. Using LLMs as process reward models to prevent reward hacking is an increasingly standard approach. 2. The Hierarchical Reward Decomposition (HRD) design may introduce a new, unaddressed risk. By rewarding "code" tokens only for `correctness` and "plan" tokens fo
The paper addresses the task of training LLM to generate Triton kernels, which is an interesting task itself. I see the main contributions from the paper are: 1. a "robust verifier" and cheating detection which gives a better reward. 2. the reward design to favor majorly for correctness of triton kernels and reward speedup on top with weight $\alpha=0.1$, leads to a better performance. The 2., "hierarchical reward decomposition", is intuitive and well-motivated. The idea of separating rewards f
1. Method - The section on "Data Mixing Optimization" looks problematic. The paper introduces a complex optimization framework, defining a reward function on test set and a reward interaction matric $\mathcal{S}$ Eq(4), which implies a sophisticated method for finding an optimal mixing probability $p$. However, the paper then states: "...we simply evaluate three candidate initializations, p in {[1,0], [0, 1], [0.5, 0.5]}, and choose the best-performing mixture...". This is a simple heuristic tes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Machine Learning and Data Classification
