KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution

Junzhe Zhang; Huixuan Zhang; Xiaojun Wan

arXiv:2510.21182·cs.CV·January 22, 2026

KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution

Junzhe Zhang, Huixuan Zhang, Xiaojun Wan

PDF

3 Reviews

TL;DR

KBE-DME introduces a dynamic, knowledge-enhanced evaluation framework for multimodal models that mitigates static benchmark limitations and offers more reliable, difficulty-controllable assessments of model capabilities.

Contribution

This work presents KBE-DME, a novel framework that transforms static benchmarks into dynamic, knowledge-integrated evaluations for multimodal large language models.

Findings

01

KBE reduces data contamination risks.

02

KBE provides more comprehensive model assessments.

03

KBE enables difficulty-controllable evaluation.

Abstract

The rapid progress of multimodal large language models (MLLMs) calls for more reliable evaluation protocols. Existing static benchmarks suffer from the potential risk of data contamination and saturation, leading to inflated or misleading performance evaluations. To address these issues, we first apply Graph formulation to represent a static or dynamic VQA sample. With the formulation, we propose Knowledge-enhanced Benchmark Evolution(KBE), a dynamic multimodal evaluation framework. KBE first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. It enables difficulty-controllable evaluation by adjusting…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

This paper is well-motivated. Such a benchmark can help evaluate MLLMs more reliably. The method is developed with the ability to control the difficulty of generated queries, showing its universality for various MLLMs. Modeling the VQA queries in the form of triplets can help inspire future work.

Weaknesses

1. The notation used in the paper could be more clearly defined. Could the authors please clarify the meaning of these symbols and specify what each subscript denotes? - For visual knowledge triplets $M=\{e_{m}\}$, what exactly is $e_m$? If $e_m$ can be expressed in $(s,r,o)$, what do $s,r$ and $o$ refer to exactly? Can $s$ stand for an image, or are they all textual content? The same question for $T=\{e_{t}\}$. - Line217. Does $m$ refer to triplet $(s,r,o)$? Why do we define $m$? Is $m$

Reviewer 02Rating 6Confidence 3

Strengths

1) Structured Representation: The graph-based formalization of VQA samples into triplets (s, r, o) is clean and conceptually useful. It provides an interpretable framework that could, in principle, be extended to other modalities. 2) Comprehensive Experimentation: Testing across multiple strong MLLMs (GPT-4o, Gemini, Claude, Qwen, LLaVA) demonstrates awareness of model diversity. The observed monotonic accuracy drop supports the intuition that the generated samples are more complex. 3) High Qu

Weaknesses

1) Overreliance on Proprietary LLMs: The framework’s key processes—triplet extraction, filtering, generation, and even evaluation—depend entirely on GPT-4o. This makes the method difficult to reproduce and raises uncertainty about whether the improvements stem from the proposed framework itself or from GPT’s latent knowledge and linguistic priors. 2) Limited Novelty in Practice: The idea of dynamically generating evaluation samples has already been explored in prior works (e.g., DyVal, VLB). Th

Reviewer 03Rating 6Confidence 3

Strengths

1. Originality: The paper introduces an original and conceptually elegant reformulation of multimodal evaluation as a dynamic, knowledge-evolving process rather than relying on static datasets. 2. Technical Innovation: It proposes a novel graph-based representation of VQA tasks combined with triplet re-selection and external knowledge exploration, offering a principled framework (KBE-DME) to mitigate benchmark contamination and saturation. 3. Experimental Rigor: The work includes comprehensive e

Weaknesses

1. Insufficient Baseline Comparison: Comparisons are mainly conducted against perturbation-based dynamic evaluation methods; the study would be stronger if it incorporated generation-based dynamic evaluation baselines such as DyVal, NPHardEval, and MPA (Meta Probing Agents), as well as multimodal counterparts like VLB. 2. Scalability and Cost: The study does not provide a thorough analysis of the framework’s computational cost, scalability, or feasibility when large LLMs are not accessible. 3. R

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.