SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation
Anna Gelencs\'er-Horv\'ath, Gergely Dinya, Dorka Bogl\'arka Er\H{o}s, P\'eter Hal\'asz, Islam Muhammad Muqsit, Krist\'of Karacs

TL;DR
SceneVGGT is a novel 3D scene understanding framework that combines SLAM and semantic mapping, enabling efficient, long-term indoor navigation with robust object identification and real-time performance.
Contribution
It introduces a scalable, memory-efficient SLAM-based semantic mapping approach using VGGT, capable of long video streams and real-time assistive navigation.
Findings
Maintains GPU memory under 17 GB regardless of sequence length
Achieves competitive results on ScanNet++ benchmark
Supports interactive assistive navigation with audio feedback
Abstract
We present SceneVGGT, a spatio-temporal 3D scene understanding framework that combines SLAM with semantic mapping for autonomous and assistive navigation. Built on VGGT, our method scales to long video streams via a sliding-window pipeline. We align local submaps using camera-pose transformations, enabling memory- and speed-efficient mapping while preserving geometric consistency. Semantics are lifted from 2D instance masks to 3D objects using the VGGT tracking head, maintaining temporally coherent identities for change detection. As a proof of concept, object locations are projected onto an estimated floor plane for assistive navigation. The pipeline's GPU memory usage remains under 17 GB, irrespectively of the length of the input sequence and achieves competitive point-cloud performance on the ScanNet++ benchmark. Overall, SceneVGGT ensures robust semantic identification and is fast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Multimodal Machine Learning Applications
