Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

Haoyi Jiang; Liu Liu; Xinjie Wang; Yonghao He; Wei Sui; Zhizhong Su; Wenyu Liu; Xinggang Wang

arXiv:2602.21186·cs.CV·February 25, 2026

Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang

PDF

Open Access

TL;DR

Spa3R introduces a self-supervised framework that learns view-invariant 3D spatial representations from multi-view images, enhancing 3D visual reasoning in vision-language models without explicit 3D supervision.

Contribution

It proposes the Predictive Spatial Field Modeling paradigm, enabling scalable 3D understanding directly from 2D images, and integrates this into VLMs for improved spatial reasoning.

Findings

01

Achieves 58.6% accuracy on 3D VQA benchmark

02

Outperforms prior methods significantly

03

Demonstrates scalable 3D reasoning from 2D images

Abstract

While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization