Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang, Shuibai Zhang, Wenqi Shao, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo

TL;DR
This paper introduces Vision-Language Bootstrapping (VLB), a dynamic evaluation protocol for LVLMs that reduces data contamination and adapts to models' evolving capabilities by generating new multimodal samples.
Contribution
The paper presents a novel dynamic evaluation method that generates diverse, consistent multimodal samples to assess LVLMs more robustly and flexibly than static benchmarks.
Findings
VLB reduces data contamination in evaluations.
VLB exposes limitations of current LVLMs.
VLB adapts to models' evolving capabilities.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples…
Peer Reviews
Decision·ICLR 2025 Oral
1. The paper identifies an important research direction: existing benchmarks are static and because of large-scale pretraining data, it is hard to verify is some test data has leaked into the pretraining or training data. This makes evaluation difficult and the paper seeks to develop a new paradigm for evaluation. 2. The idea of using insights from user interactions to inform the transformations V and L is interesting. 3. The experiments are comprehensive.
1. The role of user interaction is not defined in detail. See Q1. 2. Question rephrasing has been previously explored in several other works on VQA (eg. VQA Rephrasings dataset) or robustness work such as VQA-LOL, VQA-Subquestions and others. What is the overlap of the proposed work with those benchmarks? 3. The work focuses only on VQA but there are several tasks that VLMs can perform. Can the framework also handle capabilities that have to be evaluated without VQA?
1. Thorough experiment results validate the dynamic evaluation protocol VLB. First, a judge model is introduced to ensure that the dynamic image-question pair is still consistent with the original answer. Second, human examination on 2,100 samples verifies that less than 5% samples would introduce inconsistency. 2. The composition of multiple strategies effectively reduces data contamination and enables a wide range of difficulty levels. VLB can serve as a more reliable evaluation protocol than
1. The major concern lies in the performance variance. Even with the same image-question sample and the same bootstrapping strategy, different dynamic samples can be generated, due to the randomness in GPT-4V and PowerPaint. However, the experiments do not show the scale of this variance caused by randomness. If this variance is large, the performance metrics may be less reliable. 2. Although the human verification (Figure 11) shows high consistency for each bootstrapping strategy, it is unclea
The paper is a pleasant read and is easy to follow. The paper is written in a well structured way following a clear plot line. I particularly find the parts where the bootstrapping strategies are introduced well written, which greatly helps me understand this work on an intuitive level.
I do have one particular concern regarding the veracity of the VLB-modified data. VLB strategies such as V1 (editing in a new object in the image) and L4 (adding irrelevant context in to the text) modify the original test case in a controlled manner. **However, how do we verify if the original test cases have been loyally modified in the way we want?** So far, such veracity verification steps are only observed in Figure 11 using human verification on a small batch of sampled data. The authors s
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
