Open-Vocabulary Octree-Graph for 3D Scene Understanding
Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Bin Zhao, Xuelong Li

TL;DR
This paper introduces Octree-Graph, a new 3D scene representation that efficiently encodes spatial and semantic information for open-vocabulary understanding, improving downstream tasks like path planning and object retrieval.
Contribution
The paper proposes a novel adaptive-octree based scene representation with algorithms for 3D instance segmentation and semantic feature aggregation, enhancing 3D scene understanding.
Findings
Demonstrates improved performance on various 3D understanding tasks
Efficient storage and representation of 3D scenes with occupancy and semantics
Versatile application across multiple datasets
Abstract
Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and text-based object retrieval. To address these issues, we propose \textbf{Octree-Graph}, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques
MethodsSparse Evolutionary Training
