EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Yuhao Qing; Boyu Zhu; Mingzhe Du; Zhijiang Guo; Terry Yue Zhuo; Qianru Zhang; Jie M. Zhang; Heming Cui; Siu-Ming Yiu; Dong Huang; See-Kiong Ng; Luu Anh Tuan

arXiv:2505.13004·cs.CL·May 20, 2025

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, Luu Anh Tuan

PDF

Open Access 1 Repo 1 Datasets

TL;DR

EffiBench-X is a multi-language benchmark that evaluates the efficiency of LLM-generated code across six programming languages, revealing current models lag behind human efficiency and highlighting the need for optimization techniques.

Contribution

Introduces EffiBench-X, the first multi-language benchmark for measuring code efficiency of LLMs, with a comprehensive dataset and evaluation framework across six languages.

Findings

01

LLMs generate functionally correct code but are less efficient than humans.

02

Qwen3-32B achieves about 62% of human efficiency on average.

03

Efficiency varies significantly across programming languages.

Abstract

Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around \textbf{62\%} of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

effibench/effibench-x
noneOfficial

Datasets

EffiBench/effibench-x
dataset· 184 dl
184 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Logic, programming, and type systems · Software Testing and Debugging Techniques

MethodsFocus