TL;DR
KernelBenchX is a benchmark that evaluates the correctness and efficiency of LLM-generated GPU kernels across diverse tasks, revealing key factors affecting performance and correctness.
Contribution
It introduces KernelBenchX, a comprehensive benchmark for category-aware evaluation of GPU kernel generation, highlighting the influence of task structure and iterative refinement.
Findings
Task structure influences correctness more than method design.
Iterative refinement improves correctness but reduces speedup.
Correctness does not guarantee efficiency; quantization remains unsolved.
Abstract
LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBenchX, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
