MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo, Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

TL;DR
MMBench is a comprehensive bilingual benchmark designed to evaluate multi-modal vision-language models more accurately and holistically through quality-controlled questions, a CircularEval strategy, and bilingual assessments.
Contribution
It introduces a meticulously curated, objective evaluation pipeline with a novel CircularEval strategy and bilingual questions, surpassing existing benchmarks in scope and accuracy.
Findings
MMBench provides more diverse and comprehensive evaluation questions.
The CircularEval strategy improves accuracy in model assessment.
Bilingual questions enable fair comparison across languages.
Abstract
Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is…
Peer Reviews
Decision·Submitted to ICLR 2024
The current MLLMs greatly require a fair and reasonable benchmark to assess the strengths and weaknesses of different methods, making the problem addressed in this paper highly significant. The proposed CircularEVAL strategy effectively enhances the robustness of the evaluations.
The authors should provide results for GPT4 to establish the upper bound of performance within the proposed benchmark. For tasks that perform poorly within the current benchmark, the authors should explain why the models exhibit such poor performance. Is it due to inherent issues with the tasks themselves? Additionally, a comparison with the results of GPT4 can be made to analyze the performance shortcomings of the current open-source MLLMs.
1. The paper comes with a relatively big (3k) and well-designed benchmark for VLM evaluation, which is an important contribution. 2. Evaluation strategies are designed to test the VLMs that cannot generate single-choice answers. ChatGPT is used in this case, with an analysis compared to human evaluation to show that the introduction of ChatGPT does lead to evaluation bias. 3. The paper is well-written and easy to follow.
1. More discussions of the 20 different ability dimensions would be favored. How these dimensions are selected can be discussed further. Moreover, in many cases, multiple abilities are entangled with each other in order to correctly answer a question. For example, “How many apples are there in the image?” as shown in Fig-3 requires both numerical (counting) reasoning and perception (detect apples), which category does this example belong to? 2. The results are usually “winner takes all”. As sho
There are several strengths about this work: - The vision-language community certainly needs more objective benchmarks for evaluating recent multimodal models. - The proposed VQA benchmark covers a wide array of abilities (over 20). - The paper comprehensively tests most recent multimodal models (18 of them).
I have several major concerns about dataset collection and evaluation strategies. > **Dataset Collection and Quality** As the major contribution of this paper is the new VQA benchmark, I find the paper did a **poor job in explaining how the samples are generated, collected, and verified**. For example, how did you select images from existing sources? How did the annotator come up with QA pairs based on the images? How did you verify the correctness/relevance of these samples? From the current
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
