HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation
Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu, Zhang, Zibin Zheng

TL;DR
This paper introduces HumanEvo, an evolution-aware benchmark for evaluating repository-level code generation by LLMs, addressing the limitations of previous methods that ignore software evolution over time.
Contribution
It constructs an evolution-aware dataset, categorizes it by dependency levels, and evaluates seven LLMs, revealing overestimations in performance by previous evaluation methods.
Findings
Previous methods overestimate LLM performance by 10-61%.
Evolution-aware evaluation provides more accurate performance insights.
The benchmark and toolbox facilitate future research in realistic code generation evaluation.
Abstract
To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual code from the latest version of a project to assist LLMs in accurately generating the desired function. However, such evaluation methods fail to consider the dynamic evolution of software projects over time, which we refer to as evolution-ignored settings. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolution nature of software development. To achieve this, we first construct an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
