MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu; Haodong Duan; Yuanhan Zhang; Bo Li; Songyang Zhang; Wangbo; Zhao; Yike Yuan; Jiaqi Wang; Conghui He; Ziwei Liu; Kai Chen; Dahua Lin

arXiv:2307.06281·cs.CV·August 21, 2024·33 cites

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo, Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

PDF

Open Access 3 Repos 5 Models 5 Datasets 3 Reviews

TL;DR

MMBench is a comprehensive bilingual benchmark designed to evaluate multi-modal vision-language models more accurately and holistically through quality-controlled questions, a CircularEval strategy, and bilingual assessments.

Contribution

It introduces a meticulously curated, objective evaluation pipeline with a novel CircularEval strategy and bilingual questions, surpassing existing benchmarks in scope and accuracy.

Findings

01

MMBench provides more diverse and comprehensive evaluation questions.

02

The CircularEval strategy improves accuracy in model assessment.

03

Bilingual questions enable fair comparison across languages.

Abstract

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The current MLLMs greatly require a fair and reasonable benchmark to assess the strengths and weaknesses of different methods, making the problem addressed in this paper highly significant. The proposed CircularEVAL strategy effectively enhances the robustness of the evaluations.

Weaknesses

The authors should provide results for GPT4 to establish the upper bound of performance within the proposed benchmark. For tasks that perform poorly within the current benchmark, the authors should explain why the models exhibit such poor performance. Is it due to inherent issues with the tasks themselves? Additionally, a comparison with the results of GPT4 can be made to analyze the performance shortcomings of the current open-source MLLMs.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The paper comes with a relatively big (3k) and well-designed benchmark for VLM evaluation, which is an important contribution. 2. Evaluation strategies are designed to test the VLMs that cannot generate single-choice answers. ChatGPT is used in this case, with an analysis compared to human evaluation to show that the introduction of ChatGPT does lead to evaluation bias. 3. The paper is well-written and easy to follow.

Weaknesses

1. More discussions of the 20 different ability dimensions would be favored. How these dimensions are selected can be discussed further. Moreover, in many cases, multiple abilities are entangled with each other in order to correctly answer a question. For example, “How many apples are there in the image?” as shown in Fig-3 requires both numerical (counting) reasoning and perception (detect apples), which category does this example belong to? 2. The results are usually “winner takes all”. As sho

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

There are several strengths about this work: - The vision-language community certainly needs more objective benchmarks for evaluating recent multimodal models. - The proposed VQA benchmark covers a wide array of abilities (over 20). - The paper comprehensively tests most recent multimodal models (18 of them).

Weaknesses

I have several major concerns about dataset collection and evaluation strategies. > **Dataset Collection and Quality** As the major contribution of this paper is the new VQA benchmark, I find the paper did a **poor job in explaining how the samples are generated, collected, and verified**. For example, how did you select images from existing sources? How did the annotator come up with QA pairs based on the images? How did you verify the correctness/relevance of these samples? From the current

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning