Can Multimodal Large Language Models Understand Spatial Relations?

Jingping Liu; Ziyan Liu; Zhedong Cen; Yan Zhou; Yinan Zou; Weiyan Zhang; Haiyun Jiang; Tong Ruan

arXiv:2505.19015·cs.CV·August 11, 2025

Can Multimodal Large Language Models Understand Spatial Relations?

Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, Tong Ruan

PDF

1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces SpatialMQA, a new benchmark for spatial relation reasoning in multimodal large language models, revealing current models' significant performance gap compared to humans and guiding future research directions.

Contribution

The paper presents SpatialMQA, a high-quality, human-annotated benchmark for spatial reasoning in MLLMs, addressing limitations of previous benchmarks and providing a platform for evaluating and improving model understanding.

Findings

01

Current SOTA MLLMs achieve only 48.14% accuracy on SpatialMQA.

02

Humans achieve 98.40% accuracy, highlighting the gap in spatial understanding.

03

Extensive analysis suggests future research directions in spatial reasoning for MLLMs.

Abstract

Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ziyan-xiaoyu/spatialmqa
pytorchOfficial

Datasets

liuziyan/SpatialMQA
dataset· 461 dl
461 dl

Videos

Can Multimodal Large Language Models Understand Spatial Relations?· underline

Taxonomy

MethodsFocus