Does Spatial Cognition Emerge in Frontier Models?
Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl,, Vladlen Koltun

TL;DR
SPACE is a comprehensive benchmark designed to evaluate the spatial cognition abilities of frontier models, revealing that current models perform poorly compared to animals on classic spatial cognition tasks.
Contribution
The paper introduces SPACE, a novel benchmark for assessing spatial cognition in large language and multimodal models, inspired by cognitive science research.
Findings
Models perform near chance on animal cognition tests
SPACE enables systematic evaluation of spatial reasoning abilities
Benchmark covers large-scale mapping, object reasoning, and cognitive infrastructure
Abstract
Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition. Code and data are available: https://github.com/apple/ml-space-benchmark
Peer Reviews
Decision·ICLR 2025 Poster
1. It is interesting to test both VLM and LLM models’ large scale 2D and 3D spatial reasoning abilities using three scenarios: Ego Image/ Video, Bird’s Eye View (BEV) images, and BEV text. It also allows for the assessment of models across different modalities, including textual descriptions, single images, and video content. This design reflects the complexity of real-world spatial understanding and demonstrates the comprehensive nature of the evaluation. 2. This article uses interactive tasks
A. Providing specific details such as the resolution of the videos and images used in the experiments would be better because models might take in different sizes of input images, therefore model performance can vary significantly based on input resolution. if an image is downsampled incorrectly, it may disrupt the original aspect ratio, potentially impairing the performance in tasks that require precise spatial reasoning, such as distance and direction estimation. This issue might partially exp
- The proposed benchmarks are very extensive, covering different types of spatial cognition with many tasks and input modalities. - The paper is well written and many details are provided on the evaluation methodology, including detailed prompts and model settings. - The topic investigated in this paper is an important one, with many implications for real-world use of foundation models.
Although the benchmark appears to be well designed, there are a number of previous works that address very similar questions, and the results are not particularly surprising. Both Valmeekam et al. (2023) and Momennejad et al. (2023) have shown that LLMs are very poor at navigation and planning tasks (i.e., large-scale spatial cognition), and studies from both Yamada et al. (2024) and Ivanova et al. (2024) and have already shown that LLMs have a limited ability to reason about spatial relations.
* Very timely topic. * It is a quite impressive set of tasks that really engaged with the literature on spatial reasoning. The translation of spatial reasoning tasks to text only is an especially nice contribution, because it allows for the evaluation on text-only models. I think the benchmark will be immediately valuable to the community. * Comprehensive evaluation. Lots of models were used, which really gives us an idea as to how the current landscape of models performs at tasks like these
1. The authors took the best first step in evaluating a capability that humans have (spatial cognition), which is to directly take tasks that have been used in cognitive science to test this capability in humans. However, these are *tests that were designed for humans specifically*, not Large (vision)-Language Models. And I think this would definitely have implications on the results. For example, vision encoders like CLIP are trained on a lot of ImageNet-style object image data and so a model u
Code & Models
Videos
Taxonomy
TopicsGeographic Information Systems Studies
MethodsSoftmax · Attention Is All You Need
