SpatialMosaic: A Multiview VLM Dataset for Partial Visibility
Kanghee Lee, Injae Lee, Minseok Kwak, Jungi Hong, Kwonyoung Ryu, Jaesik Park

TL;DR
SpatialMosaic is a large-scale dataset and benchmark designed to improve multi-view spatial reasoning in multimodal models, addressing challenges like partial visibility and occlusion in real-world scenes.
Contribution
We introduce a scalable data generation pipeline, a comprehensive dataset with 2 million QA pairs, and a new benchmark for evaluating multi-view spatial reasoning in diverse scenarios.
Findings
Dataset improves spatial reasoning performance in challenging conditions.
The proposed baseline enhances multi-view spatial understanding.
Extensive experiments validate the dataset's effectiveness.
Abstract
The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling MLLMs to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
