Benchmarking Multi-Image Understanding in Vision and Language Models:   Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Bingchen Zhao; Yongshuo Zong; Letian Zhang; Timothy Hospedales

arXiv:2406.12742·cs.CV·June 19, 2024·2 cites

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces MIRB, a comprehensive benchmark for evaluating vision and language models' ability to understand and reason across multiple images, highlighting current limitations and gaps in multi-image reasoning capabilities.

Contribution

We present MIRB, the first benchmark specifically designed for multi-image understanding in vision and language models, covering perception, knowledge, reasoning, and multi-hop reasoning.

Findings

01

Open-source VLMs approach GPT-4V in single-image tasks.

02

Significant performance gap exists in multi-image reasoning tasks.

03

GPT-4V still struggles with the MIRB benchmark.

Abstract

The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dtennant/mirb_eval
pytorchOfficial

Datasets

VLLMs/MIRB
dataset· 109 dl
109 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsFocus