MIBench: Evaluating Multimodal Large Language Models over Multiple   Images

Haowei Liu; Xi Zhang; Haiyang Xu; Yaya Shi; Chaoya Jiang; Ming Yan; Ji; Zhang; Fei Huang; Chunfeng Yuan; Bing Li; Weiming Hu

arXiv:2407.15272·cs.CV·October 10, 2024·1 cites

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji, Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu

PDF

Open Access 1 Datasets

TL;DR

MIBench is a comprehensive benchmark designed to evaluate multimodal large language models' abilities in multi-image scenarios, revealing current models' limitations in fine-grained perception and reasoning with multiple images.

Contribution

This paper introduces MIBench, a new benchmark with 13 tasks and 13K samples to assess MLLMs' multi-image capabilities across three scenarios, filling a gap in existing evaluations.

Findings

01

Current MLLMs perform well on single-image tasks but struggle with multi-image inputs.

02

Models show limited fine-grained perception and reasoning in multi-image scenarios.

03

Benchmark results highlight significant room for improvement in multi-image understanding.

Abstract

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. In this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

StarBottle/MIBench
dataset· 175 dl
175 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Focus