Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang

TL;DR
Spa3R introduces a self-supervised framework that learns view-invariant 3D spatial representations from multi-view images, enhancing 3D visual reasoning in vision-language models without explicit 3D supervision.
Contribution
It proposes the Predictive Spatial Field Modeling paradigm, enabling scalable 3D understanding directly from 2D images, and integrates this into VLMs for improved spatial reasoning.
Findings
Achieves 58.6% accuracy on 3D VQA benchmark
Outperforms prior methods significantly
Demonstrates scalable 3D reasoning from 2D images
Abstract
While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization
