HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of   Repository-level Code Generation

Dewu Zheng; Yanlin Wang; Ensheng Shi; Ruikai Zhang; Yuchi Ma; Hongyu; Zhang; Zibin Zheng

arXiv:2406.06918·cs.SE·March 19, 2025·2 cites

HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation

Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu, Zhang, Zibin Zheng

PDF

Open Access

TL;DR

This paper introduces HumanEvo, an evolution-aware benchmark for evaluating repository-level code generation by LLMs, addressing the limitations of previous methods that ignore software evolution over time.

Contribution

It constructs an evolution-aware dataset, categorizes it by dependency levels, and evaluates seven LLMs, revealing overestimations in performance by previous evaluation methods.

Findings

01

Previous methods overestimate LLM performance by 10-61%.

02

Evolution-aware evaluation provides more accurate performance insights.

03

The benchmark and toolbox facilitate future research in realistic code generation evaluation.

Abstract

To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual code from the latest version of a project to assist LLMs in accurately generating the desired function. However, such evaluation methods fail to consider the dynamic evolution of software projects over time, which we refer to as evolution-ignored settings. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolution nature of software development. To achieve this, we first construct an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security