MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large   Vision-Language Models Towards Multitask AGI

Kaining Ying; Fanqing Meng; Jin Wang; Zhiqian Li; Han Lin; Yue Yang,; Hao Zhang; Wenbo Zhang; Yuqi Lin; Shuo Liu; Jiayi Lei; Quanfeng Lu; Runjian; Chen; Peng Xu; Renrui Zhang; Haozhe Zhang; Peng Gao; Yali Wang; Yu Qiao; Ping; Luo; Kaipeng Zhang; Wenqi Shao

arXiv:2404.16006·cs.CV·April 25, 2024·6 cites

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang,, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian, Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping, Luo, Kaipeng Zhang, Wenqi Shao

PDF

Open Access 1 Repo

TL;DR

MMT-Bench is a comprehensive multimodal benchmark with over 31,000 questions designed to evaluate large vision-language models across diverse tasks, highlighting current challenges and guiding future development towards general-purpose multimodal AI.

Contribution

This paper introduces MMT-Bench, the first extensive benchmark covering 32 core tasks and 162 subtasks for evaluating LVLMs in complex multimodal scenarios, addressing limitations of prior benchmarks.

Findings

01

Current LVLMs struggle with complex multimodal tasks.

02

MMT-Bench reveals significant gaps in model capabilities.

03

Evaluation of 30 LVLMs shows challenges in out-of-domain tasks.

Abstract

Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31, 325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangyue5114/DME
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques