Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective
Qiyao Xue, Weichen Liu, Shiqi Wang, Haoming Wang, Yuyang Wu, Wei Gao

TL;DR
This paper introduces ReMindView-Bench, a new benchmark for evaluating multi-view spatial reasoning in vision-language models, revealing their challenges in cross-view alignment and perspective-taking, and providing insights into their reasoning process.
Contribution
It presents a cognitively grounded benchmark and comprehensive analysis methods to diagnose and understand the limitations of current VLMs in multi-view spatial reasoning.
Findings
VLMs struggle with cross-view alignment and perspective-taking.
Performance drops significantly when integrating information across views.
Analysis reveals progressive loss of task-relevant information during reasoning.
Abstract
Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
