Benchmarking and Rethinking Knowledge Editing for Large Language Models
Guoxiu He, Xin Song, Futing Wang, Aixin Sun

TL;DR
This paper conducts a comprehensive benchmarking of knowledge editing methods for large language models, revealing the limitations of parameter-based approaches and highlighting the robustness of context-based reasoning, especially in multi-edit scenarios.
Contribution
It introduces new complex datasets, evaluates multiple editing methods under realistic settings, and demonstrates the superiority of a simple context-based baseline over existing techniques.
Findings
Parameter-based methods perform poorly in realistic conditions.
The SCR baseline consistently outperforms recent methods.
Context-based reasoning shows greater robustness in multi-edit scenarios.
Abstract
Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper is good writing and easy to follow. The idea and findings are presented very clear. 2. This paper conduct comprehensive experiments covering different types of base models, editing methods and benchmarks. 3. The findings are promising and reasonable.
1. The technical and even the practical contribution is limited. This paper didn't provide theoretical analysis why knowledge editing models hurt the model's capability or explain why sequential editing will fail. Second, this paper didn't give any technical contribution in how to mitigating the side effects of knowledge editing. Third, although the paper claims that they "aim to conduct a comprehensive benchmark study", they only use existing datasets for evaluation, instead of making any contr
1. This paper is clearly written, well organized, and generally easy to understand. 2. This paper addresses the important problem of knowledge editing in LLMs. 3. The authors' effort in conducting a large-scale, comprehensive benchmark is commendable.
The primary weakness of this paper is its limited novelty. While the benchmarking effort is extensive, the core research questions and many of the conclusions align closely with existing knowledge and prior work, making the contribution more of a confirmation than a new discovery. 1. RQ1 investigates editing under autoregressive inference and sequential editing scenarios. The challenges of sequential editing are well-investigated problems in the field. Similarly, previous works such as [1], [2]
+ I agree that autoregressive decoding, sequential edits, and portability are all important and often ignored, and it's good to see that the paper productively consolidates them in one framework. + The author conducted extensive experiments that evaluates 12 recent methods across multiple LLMs (instruct + reasoning) and dataset types (facts, event-level; plus general reasoning benchmarks). + It's also interesting to see that that editing can harm reasoning capabilities.
- Although the paper provides some empirical insights, the core advance is an evaluation protocol + dataset selection and a simple SCR baseline. There is no new algorithmic insight for knowledge editing itself, which leads to limited novelty. - The scale of edits seem to be limited. The experiments are conducted on 1 or 100 edits, but it seems that mass editing method such as Memit claims that they can scale up to thousands of edits. This raises concerns on whether the conclusions hold for larg
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies
