KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Han Wang; Jintao Zhang; Kai Jiang; Haoxu Wang; Jianfei Chen; Jun Zhu

arXiv:2605.04956·cs.LG·May 12, 2026

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Han Wang, Jintao Zhang, Kai Jiang, Haoxu Wang, Jianfei Chen, Jun Zhu

PDF

1 Repo

TL;DR

KernelBenchX is a benchmark that evaluates the correctness and efficiency of LLM-generated GPU kernels across diverse tasks, revealing key factors affecting performance and correctness.

Contribution

It introduces KernelBenchX, a comprehensive benchmark for category-aware evaluation of GPU kernel generation, highlighting the influence of task structure and iterative refinement.

Findings

01

Task structure influences correctness more than method design.

02

Iterative refinement improves correctness but reduces speedup.

03

Correctness does not guarantee efficiency; quantization remains unsolved.

Abstract

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBenchX, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BonnieW05/KernelBenchX
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.