Excision Score: Evaluating Edits with Surgical Precision
Nikolai Gruzinov, Ksenia Sycheva, Earl T. Barr, Alex Bezzubov

TL;DR
The paper introduces Excision Score, a novel static measure for evaluating document revisions that focuses on divergent regions by removing shared content, aligning better with human judgment than existing metrics.
Contribution
It proposes the Excision Score, a new similarity measure that isolates divergent content for more accurate revision evaluation, addressing flaws in existing metrics like BLEU.
Findings
Excision Score outperforms existing measures like SARI and BLEU in code editing evaluation.
ES shows higher correlation with human judgment, especially with increased shared context.
ES correctly handles code block movements and matching insertions/deletions.
Abstract
Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
In experiments on code-editing datasets, the paper demonstrates that Excision Score correlates significantly better with actual test execution results (pass/fail) than existing metrics, especially when a large amount of shared context is present.
While the work is valuable, the contribution may lack the substantiality and novelty expected for a full paper at ICLR.
- Evaluating the correctness of code is an important problem, and it is indeed not always possible to execute code. - The paper analyses highlights limitations of existing token-based metrics, and focusing on the *changed* parts of code is a sensible observation.
- The paper makes very strong claims about execution not being a suitable measure for revision tasks, yet considers approximation of execution as the only metric in its evaluation. Extensive benchmarks like SWE-Bench can take a long time to evaluate, but a significant portion of this time is the time to obtain a solution, not just the time to evaluate the solution. I would argue that under-approximating program behavior using execution is better than not even evaluating proper syntactic correctn
- A well-motivated problem: The paper clearly defines the task of revision similarity and highlights a fundamental flaw in using standard metrics like BLEU for it. As AI-driven editing becomes more common, a reliable metric for this is highly important. - An intuitive and well-motivated solution: The core idea of removing the shared context to focus on the edited regions is simple, elegant, and directly addresses the stated problem. The authors also justify their design by showing how ES avoids
- Limited Semantic Understanding: The metric is still fundamentally lexical. Since the final step uses SARI, the evaluation relies on matching n-grams within the edited regions. The paper itself acknowledges this limitation, stating that ES only "partially satisfies" its own criterion for semantic equivalence (Property 5) . While this approach is shown to handle simple cases like misplaced insertions, it would likely fail to reward more complex, semantically-equivalent-but-lexically-different co
1. The idea of using LCS to establish a metric for revision tasks is novel. 2. The empirical results can confirm the effectiveness of the proposed metric.
1. Since the correlation is based on comparing each metric with actual execution, it seems that actual execution is a perfect metric. Thus, the usefulness of the proposed metric is mainly due to its efficiency or circumstances where execution is infeasible. 2. The impact of exact matches is not considered in the evaluation. Since the metric of Exact Match can already identify generated revisions that are identical to ground truth revisions, it should be more interesting to show the effectiveness
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
