Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang; Shusheng Yang; Anjali W. Gupta; Rilyn Han; Li Fei-Fei; Saining Xie

arXiv:2412.14171·cs.CV·July 4, 2025·3 cites

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces VSI-Bench, a new benchmark for evaluating multimodal large language models' visual-spatial reasoning from videos, revealing their current capabilities and limitations in spatial understanding.

Contribution

The paper presents VSI-Bench, a novel video-based benchmark for visual-spatial intelligence, and analyzes how MLLMs process spatial information, highlighting the importance of explicit cognitive map generation.

Findings

01

MLLMs show subhuman visual-spatial reasoning abilities.

02

Explicit cognitive map generation improves spatial reasoning performance.

03

Spatial reasoning remains the main bottleneck for MLLMs' progress.

Abstract

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vision-x-nyu/thinking-in-space
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques