MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image   Relational Association Capabilities in Large Visual Language Models

Siwei Wu; Kang Zhu; Yu Bai; Yiming Liang; Yizhi Li; Haoning Wu; J.H.; Liu; Ruibo Liu; Xingwei Qu; Xuxin Cheng; Ge Zhang; Wenhao Huang; Chenghua Lin

arXiv:2407.17379·cs.CV·August 7, 2024

MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

Siwei Wu, Kang Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, J.H., Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, Chenghua Lin

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces MMRA, a comprehensive benchmark for evaluating large visual language models' ability to understand and relate multiple images at various levels of detail, revealing current limitations and areas for improvement.

Contribution

The paper presents the MMRA benchmark with 1,024 samples and 11 subtasks based on ConceptNet, focusing on multi-image relation perception at multiple granularities, a novel evaluation framework.

Findings

01

LVLMs perform better on image-level than entity-level tasks.

02

Spatial relation understanding remains a significant challenge for LVLMs.

03

Current LVLMs do not effectively model image sequences during pre-training.

Abstract

Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVLMs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks primarily focus on facts or specific topic-related knowledge contained within individual images. However, they often overlook the associative relations between multiple images, which require the identification and analysis of similarities among entities or content present in different images. Therefore, we propose the multi-image relation association task and a meticulously curated Multi-granularity Multi-image Relational Association (MMRA) benchmark, comprising 1,024 samples. In order to systematically and comprehensively evaluate current LVLMs, we establish an associational relation system among images that contain 11 subtasks (e.g, UsageSimilarity,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wusiwei0410/mmra
noneOfficial

Datasets

m-a-p/MMRA
dataset· 53 dl
53 dl

Videos

MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsFocus