Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

Xinggang Hu; Chenyangguang Zhang; Alexandros Delitzas; Xiangkui Zhang; Marc Pollefeys; Francis Engelmann; Xiangyang Ji

arXiv:2605.15753·cs.RO·May 18, 2026

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys, Francis Engelmann, Xiangyang Ji

PDF

TL;DR

This paper introduces a hierarchical open-vocabulary pipeline for constructing detailed 3D scene graphs in indoor spaces, enhancing coverage and robustness for scene understanding and robotic tasks.

Contribution

It extends existing benchmarks with dense objects and multi-level relationships, proposing a novel 2D-3D visual grounding and temporal graph optimization approach.

Findings

01

Reliable inference of functional 3D scene graphs in challenging real-world scenes

02

Effective association of nodes across frames using multiple cues

03

Robust determination of functional connections through temporal graph optimization

Abstract

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.