Benchmarking and Rethinking Knowledge Editing for Large Language Models

Guoxiu He; Xin Song; Futing Wang; Aixin Sun

arXiv:2505.18690·cs.CL·May 27, 2025

Benchmarking and Rethinking Knowledge Editing for Large Language Models

Guoxiu He, Xin Song, Futing Wang, Aixin Sun

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper conducts a comprehensive benchmarking of knowledge editing methods for large language models, revealing the limitations of parameter-based approaches and highlighting the robustness of context-based reasoning, especially in multi-edit scenarios.

Contribution

It introduces new complex datasets, evaluates multiple editing methods under realistic settings, and demonstrates the superiority of a simple context-based baseline over existing techniques.

Findings

01

Parameter-based methods perform poorly in realistic conditions.

02

The SCR baseline consistently outperforms recent methods.

03

Context-based reasoning shows greater robustness in multi-edit scenarios.

Abstract

Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. This paper is good writing and easy to follow. The idea and findings are presented very clear. 2. This paper conduct comprehensive experiments covering different types of base models, editing methods and benchmarks. 3. The findings are promising and reasonable.

Weaknesses

1. The technical and even the practical contribution is limited. This paper didn't provide theoretical analysis why knowledge editing models hurt the model's capability or explain why sequential editing will fail. Second, this paper didn't give any technical contribution in how to mitigating the side effects of knowledge editing. Third, although the paper claims that they "aim to conduct a comprehensive benchmark study", they only use existing datasets for evaluation, instead of making any contr

Reviewer 02Rating 2Confidence 3

Strengths

1. This paper is clearly written, well organized, and generally easy to understand. 2. This paper addresses the important problem of knowledge editing in LLMs. 3. The authors' effort in conducting a large-scale, comprehensive benchmark is commendable.

Weaknesses

The primary weakness of this paper is its limited novelty. While the benchmarking effort is extensive, the core research questions and many of the conclusions align closely with existing knowledge and prior work, making the contribution more of a confirmation than a new discovery. 1. RQ1 investigates editing under autoregressive inference and sequential editing scenarios. The challenges of sequential editing are well-investigated problems in the field. Similarly, previous works such as [1], [2]

Reviewer 03Rating 4Confidence 3

Strengths

+ I agree that autoregressive decoding, sequential edits, and portability are all important and often ignored, and it's good to see that the paper productively consolidates them in one framework. + The author conducted extensive experiments that evaluates 12 recent methods across multiple LLMs (instruct + reasoning) and dataset types (facts, event-level; plus general reasoning benchmarks). + It's also interesting to see that that editing can harm reasoning capabilities.

Weaknesses

- Although the paper provides some empirical insights, the core advance is an evaluation protocol + dataset selection and a simple SCR baseline. There is no new algorithmic insight for knowledge editing itself, which leads to limited novelty. - The scale of edits seem to be limited. The experiments are conducted on 1 or 100 edits, but it seems that mass editing method such as Memit claims that they can scale up to thousands of edits. This raises concerns on whether the conclusions hold for larg

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies