SST: Real-time End-to-end Monocular 3D Reconstruction via Sparse Spatial-Temporal Guidance
Chenyangguang Zhang, Zhiqiang Lou, Yan Di, Federico Tombari and, Xiangyang Ji

TL;DR
This paper introduces SST, a real-time end-to-end monocular 3D reconstruction method that leverages sparse spatial guidance and a novel temporal fusion mechanism to improve detail and accuracy in reconstructions.
Contribution
The paper presents a new network architecture with spatial-temporal fusion modules and cross-modal attention, enhancing detail and speed in monocular 3D reconstruction.
Findings
Outperforms state-of-the-art methods on ScanNet and 7-Scenes datasets.
Achieves real-time inference at 59 FPS.
Effectively captures tiny structures and geometric boundaries.
Abstract
Real-time monocular 3D reconstruction is a challenging problem that remains unsolved. Although recent end-to-end methods have demonstrated promising results, tiny structures and geometric boundaries are hardly captured due to their insufficient supervision neglecting spatial details and oversimplified feature fusion ignoring temporal cues. To address the problems, we propose an end-to-end 3D reconstruction network SST, which utilizes Sparse estimated points from visual SLAM system as additional Spatial guidance and fuses Temporal features via a novel cross-modal attention mechanism, achieving more detailed reconstruction results. We propose a Local Spatial-Temporal Fusion module to exploit more informative spatial-temporal cues from multi-view color information and sparse priors, as well a Global Spatial-Temporal Fusion module to refine the local TSDF volumes with the world-frame model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Surveying and Cultural Heritage · Optical measurement and interference techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
