MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and   Efficient Evaluation

Jinsheng Huang; Liang Chen; Taian Guo; Fu Zeng; Yusheng Zhao; Bohan; Wu; Ye Yuan; Haozhe Zhao; Zhihui Guo; Yichi Zhang; Jingyang Yuan; Wei Ju,; Luchen Liu; Tianyu Liu; Baobao Chang; Ming Zhang

arXiv:2407.00468·cs.CV·February 28, 2025

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan, Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju,, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MMEvalPro is a new multimodal benchmark that improves evaluation reliability by reducing biases, incorporating human annotations, and providing a more challenging and trustworthy assessment of large models' multimodal understanding.

Contribution

The paper introduces MMEvalPro, a benchmark with a novel annotation process and rigorous metrics, significantly enhancing the trustworthiness and difficulty of multimodal model evaluations.

Findings

01

MMEvalPro is more challenging than existing benchmarks.

02

The best LMM lags human performance by 31.73%.

03

The benchmark demonstrates improved trustworthiness in evaluation.

Abstract

Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2, 138$ question…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenllliang/mmevalpro
noneOfficial

Datasets

MM-Diagnose/MMEvalPro
dataset· 59 dl
59 dl

Videos

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation· underline

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Access Control and Trust · Risk and Safety Analysis