MonoSR: Open-Vocabulary Spatial Reasoning from Monocular Images
Qirui Wang, Jingyi He, Yining Pan, Si Yong Yeo, Xulei Yang, Shijie Li

TL;DR
MonoSR introduces a large-scale dataset for monocular spatial reasoning across diverse environments, highlighting current model limitations and guiding future research for real-world applications.
Contribution
The paper presents MonoSR, a comprehensive dataset for monocular spatial reasoning in varied scenarios, and evaluates vision-language models to identify limitations and inform future model design.
Findings
Vision-language models struggle with monocular spatial reasoning tasks.
Auxiliary information improves model performance.
MonoSR enables research on open-world monocular spatial reasoning.
Abstract
Spatial reasoning (SR), the ability to infer 3D spatial information from 2D inputs, is essential for real-world applications such as embodied AI and autonomous driving. However, existing research primarily focuses on indoor environments and typically relies on multi-view observations, which limits their generalizability to outdoor scenarios and constrains their applicability to monocular images, the most common real-world setting. In this work, we propose MonoSR, a large-scale monocular spatial reasoning dataset that spans diverse scenarios including indoor, outdoor, and object-centric settings, and supports multiple question types. MonoSR provides a path toward open-world monocular spatial reasoning. Beyond introducing the dataset, we evaluate advanced vision-language models to reveal their limitations on this challenging task. We further analyze whether auxiliary information is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
