Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Ziming Cheng; Binrui Xu; Lisheng Gong; Zuhe Song; Tianshuo Zhou; Shiqi Zhong; Siyu Ren; Mingxiang Chen; Xiangchao Meng; Yuxin Zhang; Yanlin Li; Lei Ren; Wei Chen; Zhiyuan Huang; Mingjie Zhan; Xiaojie Wang; Fangxiang Feng

arXiv:2506.04280·cs.CV·June 6, 2025

Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tianshuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xiangchao Meng, Yuxin Zhang, Yanlin Li, Lei Ren, Wei Chen, Zhiyuan Huang, Mingjie Zhan, Xiaojie Wang, Fangxiang Feng

PDF

Open Access

TL;DR

This paper introduces MMRB, a comprehensive benchmark for evaluating structured multimodal reasoning over multiple images, revealing significant performance gaps in current models and highlighting the need for improved reasoning and reward systems.

Contribution

The paper presents the first structured multi-image reasoning benchmark (MMRB) with diverse sub-tasks, GPT-4o annotations, and a scalable evaluation framework, advancing multimodal reasoning research.

Findings

01

Open-source MLLMs lag behind commercial models in multi-image reasoning.

02

Current reward models are nearly incapable of multi-image reward ranking.

03

Extensive experiments highlight the need for improved multimodal reasoning capabilities.

Abstract

With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $Multimodal Multi-image Reasoning Benchmark (MMRB)$ , the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $92 sub-tasks$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling

MethodsFocus