MIBench: Evaluating Multimodal Large Language Models over Multiple Images
Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji, Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu

TL;DR
MIBench is a comprehensive benchmark designed to evaluate multimodal large language models' abilities in multi-image scenarios, revealing current models' limitations in fine-grained perception and reasoning with multiple images.
Contribution
This paper introduces MIBench, a new benchmark with 13 tasks and 13K samples to assess MLLMs' multi-image capabilities across three scenarios, filling a gap in existing evaluations.
Findings
Current MLLMs perform well on single-image tasks but struggle with multi-image inputs.
Models show limited fine-grained perception and reasoning in multi-image scenarios.
Benchmark results highlight significant room for improvement in multi-image understanding.
Abstract
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. In this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Focus
