EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi; Valerie Chen; Ryan Shar; Aditya Mittal; Jenny Liang; Wei-Lin Chiang; Anastasios Nikolas Angelopoulos; Ion Stoica; Graham Neubig; Ameet Talwalkar; Chris Donahue

arXiv:2511.04486·cs.SE·November 18, 2025

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi, Valerie Chen, Ryan Shar, Aditya Mittal, Jenny Liang, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue

PDF

Open Access

TL;DR

This paper introduces EDIT-Bench, a comprehensive benchmark for evaluating large language models' ability to perform real-world instructed code edits, emphasizing context understanding and diverse use cases.

Contribution

The paper presents a new benchmark grounded in real-world data, covering multiple languages and use cases, to evaluate LLMs' code editing capabilities more realistically.

Findings

01

Only 1 model scores over 60% on the benchmark

02

Model performance varies significantly across instruction categories

03

Contextual information greatly impacts task success rate

Abstract

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. EDIT-Bench comprises of 540 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EDIT-Bench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EDIT-Bench is a challenging set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Teaching and Learning Programming