Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Zhiwen Fan; Jian Zhang; Wenyan Cong; Peihao Wang; Renjie Li; Kairun; Wen; Shijie Zhou; Achuta Kadambi; Zhangyang Wang; Danfei Xu; Boris Ivanovic,; Marco Pavone; Yue Wang

arXiv:2410.18956·cs.CV·November 1, 2024

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun, Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic,, Marco Pavone, Yue Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

The paper introduces Large Spatial Model (LSM), a Transformer-based system that directly reconstructs and understands 3D scenes from unposed images, integrating geometry, appearance, and semantics in real-time.

Contribution

LSM is the first end-to-end model to process unposed RGB images into semantic 3D representations using a unified Transformer architecture.

Findings

01

Achieves real-time semantic 3D reconstruction from unposed images.

02

Unifies multiple 3D vision tasks in a single feed-forward model.

03

Incorporates language-driven scene manipulation with a pre-trained segmentation model.

Abstract

Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVlabs/LSM
pytorchOfficial

Videos

Large Spatial Model: End-to-end Unposed Images to Semantic 3D· slideslive

Taxonomy

TopicsComputer Graphics and Visualization Techniques · 3D Surveying and Cultural Heritage · 3D Shape Modeling and Analysis

MethodsSparse Evolutionary Training