EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts
Garvin Kruthof

TL;DR
This paper introduces EditPropBench, a benchmark to evaluate how well large language models propagate factual edits in scientific manuscripts, revealing current limitations in cascade-aware revision capabilities.
Contribution
The paper presents a novel benchmark and metric for measuring factual edit propagation in scientific texts, highlighting the gap in current LLM editing systems.
Findings
LLMs achieve ERA scores between 0.148 and 0.705 on complex cases.
Current LLM editors miss roughly 30% of required cascade updates.
Explicit and easy cases are better handled than implicit or free-form ones.
Abstract
Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as 'medium-scale' or 'a few hundred items' may also become stale, even though they do not repeat the edited number. In an audit of recent arXiv cs.CL benchmark and dataset papers, we find fact-dependent qualitative claims in 37.2% of papers, suggesting that this dependency pattern is common in the target genre. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and unrelated text that should remain unchanged. We summarize cascade success with Edit-Ripple Adherence (ERA), the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
