Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph

Sergey Linok; Gleb Naumov

arXiv:2507.12123·cs.CV·July 17, 2025

Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph

Sergey Linok, Gleb Naumov

PDF

Open Access

TL;DR

This paper introduces OVIGo-3DHSG, a method that combines 3D hierarchical scene graphs and large language models to improve open-vocabulary object grounding and spatial reasoning in complex indoor environments.

Contribution

It presents a novel hierarchical scene graph representation from RGB-D data integrated with language models for enhanced indoor object grounding and spatial reasoning.

Findings

01

Effective scene understanding in multi-floor environments

02

Improved object grounding accuracy over existing methods

03

Robust handling of complex spatial queries

Abstract

We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Human Pose and Action Recognition