Large Spatial Model: End-to-end Unposed Images to Semantic 3D
Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun, Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic,, Marco Pavone, Yue Wang

TL;DR
The paper introduces Large Spatial Model (LSM), a Transformer-based system that directly reconstructs and understands 3D scenes from unposed images, integrating geometry, appearance, and semantics in real-time.
Contribution
LSM is the first end-to-end model to process unposed RGB images into semantic 3D representations using a unified Transformer architecture.
Findings
Achieves real-time semantic 3D reconstruction from unposed images.
Unifies multiple 3D vision tasks in a single feed-forward model.
Incorporates language-driven scene manipulation with a pre-trained segmentation model.
Abstract
Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputer Graphics and Visualization Techniques · 3D Surveying and Cultural Heritage · 3D Shape Modeling and Analysis
MethodsSparse Evolutionary Training
