Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
Amir M. Ebrahimi, Gopi Krishnan Rajbahadur

TL;DR
This paper critically evaluates existing instructed code editing benchmarks, revealing their limitations in scope and coverage, and proposes guidelines for developing more representative benchmarks.
Contribution
It provides an empirical audit of CanItEdit and EDIT-Bench, highlighting gaps in language, domain, and test coverage, and offers grounded recommendations for future benchmark design.
Findings
Both benchmarks focus mainly on Python, neglecting TypeScript and other languages.
They have limited test coverage and miss key real-world editing activities like documentation and maintenance.
Many problems in the benchmarks are not solvable by current LLMs, partly due to poor benchmark design.
Abstract
Instructed code editing, where an LLM modifies existing code based on a natural language instruction, accounts for roughly 19% of real-world coding assistant interactions. Yet very few benchmarks directly evaluate this capability. From a survey of over 150 code-related benchmarks, we find that only two, CanItEdit and EDIT-Bench, target instructed code editing with human-authored instructions and test-based evaluation. We audit both by comparing their programming languages, edit intents, and application domains against distributions observed in the wild (Copilot Arena, AIDev, GitHub Octoverse), and by measuring test counts, statement coverage, and test scope across all 213 problems. Both benchmarks concentrate over 90\% of evaluation on Python while TypeScript, GitHub's most-used language, is absent. Backend and frontend development, which together constitute 46% of real-world editing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
