Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

Amir M. Ebrahimi; Gopi Krishnan Rajbahadur

arXiv:2604.05100·cs.SE·April 8, 2026

Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

Amir M. Ebrahimi, Gopi Krishnan Rajbahadur

PDF

TL;DR

This paper critically evaluates existing instructed code editing benchmarks, revealing their limitations in scope and coverage, and proposes guidelines for developing more representative benchmarks.

Contribution

It provides an empirical audit of CanItEdit and EDIT-Bench, highlighting gaps in language, domain, and test coverage, and offers grounded recommendations for future benchmark design.

Findings

01

Both benchmarks focus mainly on Python, neglecting TypeScript and other languages.

02

They have limited test coverage and miss key real-world editing activities like documentation and maintenance.

03

Many problems in the benchmarks are not solvable by current LLMs, partly due to poor benchmark design.

Abstract

Instructed code editing, where an LLM modifies existing code based on a natural language instruction, accounts for roughly 19% of real-world coding assistant interactions. Yet very few benchmarks directly evaluate this capability. From a survey of over 150 code-related benchmarks, we find that only two, CanItEdit and EDIT-Bench, target instructed code editing with human-authored instructions and test-based evaluation. We audit both by comparing their programming languages, edit intents, and application domains against distributions observed in the wild (Copilot Arena, AIDev, GitHub Octoverse), and by measuring test counts, statement coverage, and test scope across all 213 problems. Both benchmarks concentrate over 90\% of evaluation on Python while TypeScript, GitHub's most-used language, is absent. Backend and frontend development, which together constitute 46% of real-world editing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.