SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
Zishan Liu, Ruoxi Zang, Yanglin Zhang, Wei Liu, Yin Zhang, Jian Yao, Jiayin Zheng, Zhengzhe Liu

TL;DR
SpatialForge introduces a scalable pipeline to generate large-scale 3D spatial reasoning data from 2D images, significantly enhancing vision-language models' understanding of spatial relationships.
Contribution
We propose a novel data synthesis pipeline that converts in-the-wild 2D images into structured spatial reasoning supervision, creating the 10 million QA pairs dataset SpatialForge-10M.
Findings
Training on SpatialForge-10M improves spatial reasoning in VLMs.
The dataset covers depth, layout, and viewpoint-dependent reasoning.
Experiments show significant performance gains on spatial benchmarks.
Abstract
Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
