Open-Vocabulary Octree-Graph for 3D Scene Understanding

Zhigang Wang; Yifei Su; Chenhui Li; Dong Wang; Yan Huang; Bin Zhao; Xuelong Li

arXiv:2411.16253·cs.CV·March 18, 2026

Open-Vocabulary Octree-Graph for 3D Scene Understanding

Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Bin Zhao, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces Octree-Graph, a new 3D scene representation that efficiently encodes spatial and semantic information for open-vocabulary understanding, improving downstream tasks like path planning and object retrieval.

Contribution

The paper proposes a novel adaptive-octree based scene representation with algorithms for 3D instance segmentation and semantic feature aggregation, enhancing 3D scene understanding.

Findings

01

Demonstrates improved performance on various 3D understanding tasks

02

Efficient storage and representation of 3D scenes with occupancy and semantics

03

Versatile application across multiple datasets

Abstract

Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and text-based object retrieval. To address these issues, we propose \textbf{Octree-Graph}, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques

MethodsSparse Evolutionary Training