EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

Wentao Zhang; Jianfeng Wang; Liheng Liang; Yilei Zhao; HaiBin Wen; Zhe Zhao

arXiv:2602.10171·cs.SE·February 12, 2026

EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

Wentao Zhang, Jianfeng Wang, Liheng Liang, Yilei Zhao, HaiBin Wen, Zhe Zhao

PDF

Open Access

TL;DR

EvoCodeBench is a new benchmark designed to evaluate self-evolving large language model-driven coding systems, measuring their iterative improvement, efficiency, and robustness across languages compared to human programmers.

Contribution

It introduces a comprehensive benchmark that captures inference-time self-evolution, resource costs, and cross-language stability, with direct comparison to human performance.

Findings

01

Self-evolving systems show measurable efficiency gains over time.

02

Benchmark enables analysis of cross-language robustness and long-tail language stability.

03

Human performance provides a valuable reference point for evaluating LLM coding systems.

Abstract

As large language models (LLMs) continue to advance in programming tasks, LLM-driven coding systems have evolved from one-shot code generation into complex systems capable of iterative improvement during inference. However, existing code benchmarks primarily emphasize static correctness and implicitly assume fixed model capability during inference. As a result, they do not capture inference-time self-evolution, such as whether accuracy and efficiency improve as an agent iteratively refines its solutions. They also provide limited accounting of resource costs and rarely calibrate model performance against that of human programmers. Moreover, many benchmarks are dominated by high-resource languages, leaving cross-language robustness and long-tail language stability underexplored. Therefore, we present EvoCodeBench, a benchmark for evaluating self-evolving LLM-driven coding systems across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science