Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces
Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys, Francis Engelmann, Xiangyang Ji

TL;DR
This paper introduces a hierarchical open-vocabulary pipeline for constructing detailed 3D scene graphs in indoor spaces, enhancing coverage and robustness for scene understanding and robotic tasks.
Contribution
It extends existing benchmarks with dense objects and multi-level relationships, proposing a novel 2D-3D visual grounding and temporal graph optimization approach.
Findings
Reliable inference of functional 3D scene graphs in challenging real-world scenes
Effective association of nodes across frames using multiple cues
Robust determination of functional connections through temporal graph optimization
Abstract
Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
