SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

Zishan Liu; Ruoxi Zang; Yanglin Zhang; Wei Liu; Yin Zhang; Jian Yao; Jiayin Zheng; Zhengzhe Liu

arXiv:2605.11462·cs.CV·May 13, 2026

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

Zishan Liu, Ruoxi Zang, Yanglin Zhang, Wei Liu, Yin Zhang, Jian Yao, Jiayin Zheng, Zhengzhe Liu

PDF

TL;DR

SpatialForge introduces a scalable pipeline to generate large-scale 3D spatial reasoning data from 2D images, significantly enhancing vision-language models' understanding of spatial relationships.

Contribution

We propose a novel data synthesis pipeline that converts in-the-wild 2D images into structured spatial reasoning supervision, creating the 10 million QA pairs dataset SpatialForge-10M.

Findings

01

Training on SpatialForge-10M improves spatial reasoning in VLMs.

02

The dataset covers depth, layout, and viewpoint-dependent reasoning.

03

Experiments show significant performance gains on spatial benchmarks.

Abstract

Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.