Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Xu Zheng; Zihao Dongfang; Lutao Jiang; Boyuan Zheng; Yulong Guo; Zhenquan Zhang; Giuliano Albanese; Runyi Yang; Mengjiao Ma; Zixin Zhang; Chenfei Liao; Dingcheng Zhen; Yuanhuiyi Lyu; Yuqian Fu; Bin Ren; Linfeng Zhang; Danda Pani Paudel; Nicu Sebe; Luc Van Gool; Xuming Hu

arXiv:2510.25760·cs.CV·November 4, 2025

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

PDF

TL;DR

This survey reviews recent progress in multimodal spatial reasoning with large models, introduces benchmarks, and discusses tasks across 2D, 3D, and embodied AI, highlighting advances in multimodal modalities like audio and egocentric video.

Contribution

It provides a comprehensive categorization of multimodal spatial reasoning tasks, introduces open benchmarks, and discusses recent progress in large multimodal reasoning models.

Findings

01

Progress in multimodal large language models (MLLMs) for spatial reasoning

02

Introduction of open benchmarks for evaluation

03

Coverage of spatial tasks across 2D, 3D, and embodied AI

Abstract

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.