MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
Yuntao Du, Kailin Jiang, Zhi Gao, Chenrui Shi, Zilong Zheng, Siyuan, Qi, Qing Li

TL;DR
MMKE-Bench is a new comprehensive benchmark designed to evaluate the ability of multimodal models to edit diverse visual knowledge using natural language, addressing limitations of existing entity-focused benchmarks.
Contribution
Introduces MMKE-Bench, a multimodal knowledge editing benchmark with diverse tasks and natural language format, to better evaluate real-world visual knowledge editing in LMMs.
Findings
Current methods struggle with visual and user-specific edits.
No single method outperforms others across all tasks.
Benchmark reveals gaps in existing knowledge editing techniques.
Abstract
Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit…
Peer Reviews
Decision·ICLR 2025 Poster
1. The benchmark proposed in this paper uses knowledge represented in free-form natural language, making it more applicable to real-world scenarios. In addition to traditional visual entity editing, the benchmark incorporates visual semantic editing and user-specific editing, allowing for a more comprehensive evaluation of model editing capabilities. 2. The paper provides a detailed description of the dataset construction process, offering valuable insights and methodologies for data collection
1. I’m not sure if the workload is sufficient. If this were solely a Dataset & Benchmark Track submission, the workload would be appropriate. However, as a long paper submission to ICLR, it may require additional technical contributions. For instance, adding theoretical analysis to explain why existing methods perform poorly in multimodal knowledge editing could provide new perspectives for improving this area. Therefore, it’s recommended to include an in-depth analysis on why current methods un
(1) Diverse Task Setup: MMKE-Bench covers three distinct knowledge editing tasks, from entity-level editing to more complex user-specific knowledge editing. This provides a comprehensive tool for evaluating multimodal models' ability to update knowledge and handle personalized information. (2) Free-Form Natural Language Descriptions: Unlike traditional triple-based representations, this benchmark uses natural language descriptions to represent knowledge items, enabling models to engage in editi
(1) Figures 1 and 2 Could Benefit from Improved Clarity and Accessibility: Figures 1 and 2 could be refined to enhance clarity and accessibility for a broader audience. In Figure 1, the example used is soccer-related, which may not be immediately understandable to readers unfamiliar with the sport. A more universally recognizable example, such as common objects or activities, could make the data construction process clearer. For Figure 2, the visual design could better distinguish the four datas
The use of free-form natural language as input for knowledge editing tasks is a notable strength, enhancing flexibility and making the approach adaptable. The clarity of the writing aids comprehension, and the experimental setup is well-documented. Additionally, the benchmark spans diverse data sources and entity types, allowing for broad applicability across different tasks.
There is some overlap between visual entity editing and visual semantic editing, as both tasks involve understanding image content, which could blur the distinction between these editing types. Additionally, the user-specific editing scenario may lack practicality. In real-world applications, database or memory-based search might be more effective than training user-specific information for each user to achieve personalization in LLMs or LMMs. Regarding the T-Loc test, there’s room for improveme
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling
MethodsFocus
