MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models
Siwei Wu, Kang Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, J.H., Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, Chenghua Lin

TL;DR
This paper introduces MMRA, a comprehensive benchmark for evaluating large visual language models' ability to understand and relate multiple images at various levels of detail, revealing current limitations and areas for improvement.
Contribution
The paper presents the MMRA benchmark with 1,024 samples and 11 subtasks based on ConceptNet, focusing on multi-image relation perception at multiple granularities, a novel evaluation framework.
Findings
LVLMs perform better on image-level than entity-level tasks.
Spatial relation understanding remains a significant challenge for LVLMs.
Current LVLMs do not effectively model image sequences during pre-training.
Abstract
Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVLMs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks primarily focus on facts or specific topic-related knowledge contained within individual images. However, they often overlook the associative relations between multiple images, which require the identification and analysis of similarities among entities or content present in different images. Therefore, we propose the multi-image relation association task and a meticulously curated Multi-granularity Multi-image Relational Association (MMRA) benchmark, comprising 1,024 samples. In order to systematically and comprehensively evaluate current LVLMs, we establish an associational relation system among images that contain 11 subtasks (e.g, UsageSimilarity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsFocus
