MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models   for Integrated Capabilities

Weihao Yu; Zhengyuan Yang; Lingfeng Ren; Linjie Li; Jianfeng Wang,; Kevin Lin; Chung-Ching Lin; Zicheng Liu; Lijuan Wang; Xinchao Wang

arXiv:2408.00765·cs.CV·December 3, 2024·3 cites

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Lingfeng Ren, Linjie Li, Jianfeng Wang,, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, Xinchao Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

MM-Vet v2 introduces a new benchmark for large multimodal models that evaluates their ability to understand interleaved image-text sequences, expanding beyond previous single image-text pair assessments.

Contribution

The paper presents MM-Vet v2, a new benchmark including a novel 'image-text sequence understanding' capability and an expanded evaluation set, enhancing the assessment of multimodal models.

Findings

01

Claude 3.5 Sonnet scores highest at 71.8

02

GPT-4o scores 71.0, slightly below Claude 3.5 Sonnet

03

Open-weight InternVL2-Llama3-76B achieves 68.4

Abstract

MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuweihao/mm-vet
pytorchOfficial

Datasets

whyu/mm-vet-v2
dataset· 434 dl
434 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems