SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation

Anna Gelencs\'er-Horv\'ath; Gergely Dinya; Dorka Bogl\'arka Er\H{o}s; P\'eter Hal\'asz; Islam Muhammad Muqsit; Krist\'of Karacs

arXiv:2602.15899·cs.RO·February 20, 2026

SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation

Anna Gelencs\'er-Horv\'ath, Gergely Dinya, Dorka Bogl\'arka Er\H{o}s, P\'eter Hal\'asz, Islam Muhammad Muqsit, Krist\'of Karacs

PDF

Open Access

TL;DR

SceneVGGT is a novel 3D scene understanding framework that combines SLAM and semantic mapping, enabling efficient, long-term indoor navigation with robust object identification and real-time performance.

Contribution

It introduces a scalable, memory-efficient SLAM-based semantic mapping approach using VGGT, capable of long video streams and real-time assistive navigation.

Findings

01

Maintains GPU memory under 17 GB regardless of sequence length

02

Achieves competitive results on ScanNet++ benchmark

03

Supports interactive assistive navigation with audio feedback

Abstract

We present SceneVGGT, a spatio-temporal 3D scene understanding framework that combines SLAM with semantic mapping for autonomous and assistive navigation. Built on VGGT, our method scales to long video streams via a sliding-window pipeline. We align local submaps using camera-pose transformations, enabling memory- and speed-efficient mapping while preserving geometric consistency. Semantics are lifted from 2D instance masks to 3D objects using the VGGT tracking head, maintaining temporally coherent identities for change detection. As a proof of concept, object locations are projected onto an estimated floor plane for assistive navigation. The pipeline's GPU memory usage remains under 17 GB, irrespectively of the length of the input sequence and achieves competitive point-cloud performance on the ScanNet++ benchmark. Overall, SceneVGGT ensures robust semantic identification and is fast…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Multimodal Machine Learning Applications