Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, Ranjay Krishna

TL;DR
This paper introduces STARE, a benchmark for evaluating multimodal models on complex visual spatial reasoning tasks, revealing current models' limitations in multi-step visual simulation compared to humans.
Contribution
The paper presents STARE, a comprehensive benchmark for assessing multimodal models on visual spatial reasoning, highlighting their struggles with multi-step visual simulations.
Findings
Models perform well on simple 2D transformations.
Models struggle with complex 3D and multi-step tasks.
Humans outperform models significantly in accuracy and speed.
Abstract
Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well-motivated and well-written, with a clear structure and sufficient detail about the execution of both dataset construction and experiments, which makes it easy to replicate and build upon. The benchmark covers the spatial reasoning tasks the authors target, with multiple difficulty levels and checks for different types of reasoning in models. The evaluations are robust, including settings with and without visual simulations and step-by-step reasoning. The findings are presented
I didn't notice major issues with this paper, but the few minor points I saw are as follows: There is a problem with Figure 1: the question image does not match the step images. The text mentions "Fig. 10" on line 196, but I assume the correct reference is the figure references 9 and 10. And Figure 10 has missing pieces. Line 323: "Notably, o3 seems to better at leveraging visual simulations" has a grammatical error.
The paper presents the problem of composite reasoning through geometric transformation tasks clearly and concisely. I find these particular strengths of the paper interesting: 1) Extensive task definition - The variety of tasks covering: (i) integrated spatial reasoning over 2D and 3D, (ii) Foundational geometric transformations, and (iii) real-world spatial reasoning encompasses a majority of reasoning behaviour abilities for MLLMs. 2) Behavioural consistency with prior works - The paper demon
I have only one major weakness/ criticism to identify for the authors, although this has been highlighted in the limitations: 1) Comprehensive answering for models can be incorporated into the work, i.e. moving beyond just multiple-choice and binary question answering, towards one-word or two-word answering to also analyse the probabilistic tendency to actually give the true answer as a response and intermediate reasoning, and not just choose an option.
1. Novel and Well-Motivated Benchmark: The paper introduces STARE, a new benchmark that addresses a critical area of multimodal AI: multi-step spatial reasoning. It provides a structured framework for diagnosing the spatial cognition capabilities of models. 2. Thorough Experimental Evaluation: The study is grounded in a comprehensive evaluation of a wide range of contemporary models, including a crucial human performance baseline. The analysis offers some valuable insights into model failure mo
1. **Potential Bias in the "Simulation" Paradigm**: The paper's core claim is about evaluating "visual simulations." However, its primary method for this—providing intermediate visual steps—tests a model's ability to interpret a given sequence of images, not its ability to generate that sequence internally. A human performing mental rotation isn't shown snapshots; their mind creates the intermediate frames. This is a subtle but critical distinction. The benchmark is, therefore, a stronger test o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
