ExecRepoBench: Multi-level Executable Code Completion Evaluation

Jian Yang; Jiajun Zhang; Jiaxi Yang; Ke Jin; Lei Zhang; Qiyao Peng,; Ken Deng; Yibo Miao; Tianyu Liu; Zeyu Cui; Binyuan Hui; Junyang Lin

arXiv:2412.11990·cs.CL·December 17, 2024

ExecRepoBench: Multi-level Executable Code Completion Evaluation

Jian Yang, Jiajun Zhang, Jiaxi Yang, Ke Jin, Lei Zhang, Qiyao Peng,, Ken Deng, Yibo Miao, Tianyu Liu, Zeyu Cui, Binyuan Hui, Junyang Lin

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ExecRepoBench, a new multi-level benchmark for evaluating code completion in complex, multi-file Python projects, and demonstrates a fine-tuned open-source model that outperforms existing benchmarks.

Contribution

It presents a novel multi-level, repository-level benchmark and a grammar-based code completion methodology, along with a fine-tuned open-source model, improving real-world code completion performance.

Findings

01

Qwen2.5-Coder-Instruct-C outperforms prior baselines across languages.

02

ExecRepoBench provides 1.2K real-world Python samples.

03

The framework enables more realistic code completion evaluation.

Abstract

Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CSJianYang/ExecRepoBench
dataset· 96 dl
96 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Reliability and Analysis Research

Methodstravel james