MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang,, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian, Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping, Luo, Kaipeng Zhang, Wenqi Shao

TL;DR
MMT-Bench is a comprehensive multimodal benchmark with over 31,000 questions designed to evaluate large vision-language models across diverse tasks, highlighting current challenges and guiding future development towards general-purpose multimodal AI.
Contribution
This paper introduces MMT-Bench, the first extensive benchmark covering 32 core tasks and 162 subtasks for evaluating LVLMs in complex multimodal scenarios, addressing limitations of prior benchmarks.
Findings
Current LVLMs struggle with complex multimodal tasks.
MMT-Bench reveals significant gaps in model capabilities.
Evaluation of 30 LVLMs shows challenges in out-of-domain tasks.
Abstract
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering core meta-tasks and subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
