TL;DR
LingBot-Map is a novel geometric context transformer model that enables real-time, accurate, and consistent streaming 3D scene reconstruction from video data, outperforming existing methods.
Contribution
The paper introduces LingBot-Map, a new feed-forward 3D foundation model with a specialized attention mechanism for efficient, stable streaming 3D reconstruction.
Findings
Achieves around 20 FPS on high-resolution inputs over long sequences.
Outperforms existing streaming and optimization-based approaches on various benchmarks.
Maintains rich geometric context with a compact streaming state.
Abstract
Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
