TL;DR
PerfCodeBench is a new benchmark for evaluating large language models on high-performance, systems-level code optimization tasks, highlighting the gap between generated code and expert-optimized solutions.
Contribution
Introduces PerfCodeBench, an executable benchmark for assessing LLMs on system-level optimization, including correctness and efficiency metrics, with comprehensive evaluation results.
Findings
Significant gap between LLM-generated code and expert-optimized implementations.
Models struggle with parallelism and GPU-related tasks.
Current models are weak in cross-language robustness and efficiency.
Abstract
Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
