How Efficient is LLM-Generated Code? A Rigorous & High-Standard   Benchmark

Ruizhong Qiu; Weiliang Will Zeng; James Ezick; Christopher Lott,; Hanghang Tong

arXiv:2406.06647·cs.SE·February 20, 2025·2 cites

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott,, Hanghang Tong

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces ENAMEL, a rigorous benchmark for evaluating the efficiency of code generated by large language models, highlighting current models' shortcomings in producing expert-level efficient code.

Contribution

The paper proposes a new efficiency metric eff@k, develops an unbiased estimator, and establishes a high-standard benchmark with expert-designed reference solutions for evaluating LLM-generated code efficiency.

Findings

01

LLMs still underperform in generating efficient, expert-level code.

02

Current LLMs struggle with advanced algorithm design and optimization.

03

ENAMEL provides a rigorous framework for efficiency evaluation.

Abstract

The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations. In this work, we develop ENAMEL (EfficeNcy AutoMatic EvaLuator), a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code. Firstly, we propose a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and appropriately handles right-censored execution time. Furthermore, we derive an unbiased and variance-reduced estimator of eff@k via Rao--Blackwellization; we also provide a numerically stable implementation for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

q-rz/enamel
noneOfficial

Datasets

q-rz/enamel
dataset· 116 dl
116 dl

Videos

How efficient is LLM-generated code? A rigorous & high-standard benchmark· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training · Focus · Attentive Walk-Aggregating Graph Neural Network