3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng

TL;DR
3ViewSense introduces a novel framework that enhances spatial reasoning in vision-language models by grounding them in orthographic views, effectively bridging the spatial intelligence gap and improving performance on spatial tasks.
Contribution
The paper proposes 3ViewSense, a new approach that uses orthographic projections and a simulate-and-reason mechanism to improve spatial reasoning in vision-language models.
Findings
Significant performance improvements on spatial reasoning benchmarks.
Enhanced stability and consistency in spatial descriptions.
Better handling of occlusion-heavy counting and view alignment.
Abstract
Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
