VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations
Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Hamid Rezatofighi

TL;DR
This paper introduces VIEW2SPACE, a new benchmark for multi-view visual reasoning using simulated 3D scenes, revealing current models' limitations and proposing methods to improve reasoning across sparse views.
Contribution
The paper presents a scalable simulation-based benchmark for multi-view reasoning and evaluates state-of-the-art models, highlighting the challenges and proposing grounded chain-of-thought methods for improvement.
Findings
Multi-view reasoning models perform only marginally better than random.
Grounded Chain-of-Thought improves performance on moderate difficulty questions.
Scaling models benefits geometric perception but not deep reasoning across sparse views.
Abstract
Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
