CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
Shiyang Li, Zijian Zhang, Guangyan Sun, Yuebo Luo, Winson Chen, Yanzhi Wang, Mingyi Hong, Caiwen Ding

TL;DR
CUDAHercules is a comprehensive benchmark that assesses the ability of AI models to generate expert-level CUDA code for large language models, revealing significant gaps in current AI capabilities.
Contribution
Introduces CUDAHercules, a benchmark for evaluating AI-generated CUDA code against expert standards across multiple hardware architectures and tasks.
Findings
Models often compile and pass tests but lack expert optimization strategies.
Semantic understanding reduces success rates in CUDA code generation.
Feedback and tools can improve correctness but lead to slower fallback solutions.
Abstract
Large language models show promise for automated CUDA programming, however even the strongest coding models (e.g., Claude-Opus-4.6) may still fall short of expert-level, architecture-aware optimization. We introduce CUDAHercules, a benchmark that evaluates generated CUDA against end-to-end human-expert SOTA systems. It spans single kernels, module-level operators, full applications, and unsolved challenge tasks across Ampere, Hopper, and Blackwell GPUs, with end-to-end tasks gated by domain-specific semantic validators. Evaluating models such as Claude-Opus-4.6 and GPT-5.4 shows a large gap between runnable CUDA and expert CUDA engineering: models often compile and pass tests, but rarely recover the optimization strategies needed to match expert performance. Application semantics further reduce success, and iterative or tool-augmented feedback can improve correctness while drifting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
