How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark
Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott,, Hanghang Tong

TL;DR
This paper introduces ENAMEL, a rigorous benchmark for evaluating the efficiency of code generated by large language models, highlighting current models' shortcomings in producing expert-level efficient code.
Contribution
The paper proposes a new efficiency metric eff@k, develops an unbiased estimator, and establishes a high-standard benchmark with expert-designed reference solutions for evaluating LLM-generated code efficiency.
Findings
LLMs still underperform in generating efficient, expert-level code.
Current LLMs struggle with advanced algorithm design and optimization.
ENAMEL provides a rigorous framework for efficiency evaluation.
Abstract
The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations. In this work, we develop ENAMEL (EfficeNcy AutoMatic EvaLuator), a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code. Firstly, we propose a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and appropriately handles right-censored execution time. Furthermore, we derive an unbiased and variance-reduced estimator of eff@k via Rao--Blackwellization; we also provide a numerically stable implementation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training · Focus · Attentive Walk-Aggregating Graph Neural Network
