SST: Real-time End-to-end Monocular 3D Reconstruction via Sparse   Spatial-Temporal Guidance

Chenyangguang Zhang; Zhiqiang Lou; Yan Di; Federico Tombari and; Xiangyang Ji

arXiv:2212.06524·cs.CV·July 26, 2023

SST: Real-time End-to-end Monocular 3D Reconstruction via Sparse Spatial-Temporal Guidance

Chenyangguang Zhang, Zhiqiang Lou, Yan Di, Federico Tombari and, Xiangyang Ji

PDF

Open Access

TL;DR

This paper introduces SST, a real-time end-to-end monocular 3D reconstruction method that leverages sparse spatial guidance and a novel temporal fusion mechanism to improve detail and accuracy in reconstructions.

Contribution

The paper presents a new network architecture with spatial-temporal fusion modules and cross-modal attention, enhancing detail and speed in monocular 3D reconstruction.

Findings

01

Outperforms state-of-the-art methods on ScanNet and 7-Scenes datasets.

02

Achieves real-time inference at 59 FPS.

03

Effectively captures tiny structures and geometric boundaries.

Abstract

Real-time monocular 3D reconstruction is a challenging problem that remains unsolved. Although recent end-to-end methods have demonstrated promising results, tiny structures and geometric boundaries are hardly captured due to their insufficient supervision neglecting spatial details and oversimplified feature fusion ignoring temporal cues. To address the problems, we propose an end-to-end 3D reconstruction network SST, which utilizes Sparse estimated points from visual SLAM system as additional Spatial guidance and fuses Temporal features via a novel cross-modal attention mechanism, achieving more detailed reconstruction results. We propose a Local Spatial-Temporal Fusion module to exploit more informative spatial-temporal cues from multi-view color information and sparse priors, as well a Global Spatial-Temporal Fusion module to refine the local TSDF volumes with the world-frame model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Surveying and Cultural Heritage · Optical measurement and interference techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings