MindCube: Spatial Mental Modeling from Limited Views

Qineng Wang; Baiqiao Yin; Pingyue Zhang; Jianshu Zhang; Kangrui Wang; Zihan Wang; Jieyu Zhang; Keshigeyan Chandrasegaran; Han Liu; Ranjay Krishna; Saining Xie; Jiajun Wu; Li Fei-Fei; Manling Li

arXiv:2506.21458·cs.AI·April 1, 2026

MindCube: Spatial Mental Modeling from Limited Views

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, Manling Li

PDF

1 Repo 2 Models 2 Datasets

TL;DR

This paper introduces the MindCube benchmark to evaluate vision-language models' ability to form spatial mental models from limited views, and proposes a 'map-then-reason' approach that significantly improves performance.

Contribution

The paper presents a new benchmark and a novel training approach that enhances VLMs' capacity to build and reason over internal spatial representations.

Findings

01

VLMs perform near-random on the MindCube benchmark.

02

The 'map-then-reason' approach improves accuracy from 37.8% to 57.8%.

03

Reinforcement learning further boosts performance to 61.3%.

Abstract

Can Vision-Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models naturally, internal representations of unseen space, to reason about layout, perspective, and motion. Our MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help approximate spatial mental models in VLMs, focusing on incorporating unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qinengwang-aiden/mindcube
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.