Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM
Gergely Dinya, P\'eter Hal\'asz, Andr\'as L\H{o}rincz, Krist\'of Karacs, Anna Gelencs\'er-Horv\'ath

TL;DR
This paper introduces a fast, memory-efficient 3D mapping framework using VGGT for semantic SLAM, capable of real-time scene understanding and change detection, suitable for assistive navigation.
Contribution
The work presents a novel spatio-temporal scene understanding pipeline that overcomes VGGT's memory limitations and incorporates temporal consistency for dynamic environment mapping.
Findings
Achieves near real-time performance in 3D scene mapping.
Effectively detects environmental changes over time.
Demonstrates applicability on benchmarks and real-world datasets.
Abstract
We present a fast, spatio-temporal scene understanding framework based on Visual Geometry Grounded Transformer (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Multimodal Machine Learning Applications · Advanced Vision and Imaging
