Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?
Zhenyu Pan, Rongyu Cao, Yongchang Cao, Yingwei Ma, Binhua Li, Fei, Huang, Han Liu, Yongbin Li

TL;DR
Codev-Bench is a new evaluation framework for code completion tools that uses real-world, repository-level data and dynamic testing to better reflect developer needs and improve comparison fairness.
Contribution
The paper introduces Codev-Bench, a novel, developer-centric benchmark that leverages an agent-based system for realistic, repository-level evaluation of code completion tools.
Findings
Codev-Bench enables more accurate assessment of code completion tools in real-world scenarios.
The agent-based system improves the fairness and relevance of benchmark comparisons.
Results show better alignment with developer intent than existing benchmarks.
Abstract
Code completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation benchmark that enables meaningful comparisons between products and guides future advancements. However, existing benchmarks focus more on coarse-grained tasks without industrial analysis resembling general code generation rather than the real-world scenarios developers encounter. Moreover, these benchmarks often rely on costly and time-consuming human annotation, and the standalone test cases fail to leverage minimal tests for maximum repository-level understanding and code coverage. To address these limitations, we first analyze business data from an industrial code completion tool and redefine the evaluation criteria to better align with the…
Peer Reviews
Decision·Submitted to ICLR 2025
- Nice to see a benchmark inspired by actual use - Neat trick to generate test cases to assess code completion quality
The 'product business data analysis' description lacks detail. How many data points were collected? From what type of people? Using which language models? Was this a standard product like GitHub co-pilot or something else? How was line completion integrated in the IDE? There are also several existing publications reporting on industrial or practical usage of line completion, using different categories. See, e.g., https://arxiv.org/abs/2402.16197 Also for Codedev-Bench, crucial details are missi
- *Data-driven design*: The benchmark is grounded in insights from an industrial code completion tool, allowing Codev-Bench to closely align with real-world developer workflows and capture diverse usage scenarios. This industry-driven design enhances the benchmark's practical significance. - Moves beyond traditional function-level tasks by introducing scenario-based evaluation with granular tasks. - Extensive experimental results across four different completion scenarios, showcasing the strengt
- *Limited discussion/details on the evaluation of completed code at different levels of granularity*: For in-line or single-line code completion, testing against all unit tests might be excessive. Since these completions are smaller and more context-specific, they often don’t impact the broader program behavior significantly. - *Benchmark adaptability*: The focus on industrial-level repositories and high-quality data is valuable, yet the paper could discuss more on the adaptability of Codev-Be
- The authors proposed CodevAgent as a framework to automatically generate new test samples and evaluate LLMs without the need for extensive manual annotations. This can be crucial for preventing data leakage and for effectively scaling the benchmark. - The idea of collecting and analyzing true needs of software developers could be quite impactful, given the relevance to downstream applications. - The authors conducted a comprehensive evaluation of multiple General LLMs and Code LLMs. - The eval
- The details regarding the construction of the benchmark with CodevAgent are somewhat unclear. For instance, while CodevAgent appears to utilize unit tests from the raw repositories for evaluation, the authors do not clearly explain how to ensure the quality of these test cases or the selection of benchmark repositories, given that raw repository tests might not be reliable. It is also unclear the selection rate, data quality, and time investment in the transition from raw repositories to well-
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Artificial Intelligence in Law
MethodsFocus · ALIGN
