CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
Shiyang Li, Haoyang Chen, Mattia Fazzini, Caiwen Ding

TL;DR
CUDABeaver introduces a benchmark and evaluation protocol for assessing the effectiveness of LLM-based CUDA debugging tools, emphasizing the importance of performance preservation and realistic failure scenarios.
Contribution
The paper presents CUDABeaver, a new benchmark with a protocol-conditional metric for more accurate evaluation of LLM-based CUDA debugging methods.
Findings
Protocol-aware evaluation reveals significant performance sensitivity in CUDA fixers.
When allowing high performance loss, fixers show much higher success rates.
Stricter performance requirements sharply reduce the measured success of debugging tools.
Abstract
Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
