Can Multimodal Large Language Models Understand Spatial Relations?
Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, Tong Ruan

TL;DR
This paper introduces SpatialMQA, a new benchmark for spatial relation reasoning in multimodal large language models, revealing current models' significant performance gap compared to humans and guiding future research directions.
Contribution
The paper presents SpatialMQA, a high-quality, human-annotated benchmark for spatial reasoning in MLLMs, addressing limitations of previous benchmarks and providing a platform for evaluating and improving model understanding.
Findings
Current SOTA MLLMs achieve only 48.14% accuracy on SpatialMQA.
Humans achieve 98.40% accuracy, highlighting the gap in spatial understanding.
Extensive analysis suggests future research directions in spatial reasoning for MLLMs.
Abstract
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsFocus
