MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan, Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju,, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

TL;DR
MMEvalPro is a new multimodal benchmark that improves evaluation reliability by reducing biases, incorporating human annotations, and providing a more challenging and trustworthy assessment of large models' multimodal understanding.
Contribution
The paper introduces MMEvalPro, a benchmark with a novel annotation process and rigorous metrics, significantly enhancing the trustworthiness and difficulty of multimodal model evaluations.
Findings
MMEvalPro is more challenging than existing benchmarks.
The best LMM lags human performance by 31.73%.
The benchmark demonstrates improved trustworthiness in evaluation.
Abstract
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises question…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Access Control and Trust · Risk and Safety Analysis
