TL;DR
LEXI-SG is a novel monocular RGB-based system that creates dense, open-vocabulary 3D scene graphs for indoor environments, enabling scalable and accurate scene understanding for robot navigation.
Contribution
It introduces the first monocular visual mapping system for open-vocabulary 3D scene graphs, leveraging semantic priors and a room-based factor graph formulation.
Findings
Improved trajectory estimation and dense reconstruction over existing methods.
Achieves competitive open-vocabulary segmentation performance.
Demonstrates scalable dense mapping without sliding-window inconsistencies.
Abstract
Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
