Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene   Understanding

Hanchen Tai; Qingdong He; Jiangning Zhang; Yijie Qian; Zhenyu Zhang,; Xiaobin Hu; Xiangtai Li; Yabiao Wang; Yong Liu

arXiv:2405.15580·cs.CV·September 6, 2024

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang,, Xiaobin Hu, Xiangtai Li, Yabiao Wang, Yong Liu

PDF

Open Access

TL;DR

This paper introduces OV-SAM3D, a training-free framework leveraging SAM and RAM for open-vocabulary 3D scene understanding, enabling zero-shot recognition without prior scene knowledge.

Contribution

The paper presents OV-SAM3D, a novel training-free method that combines superpoint prompts, SAM segmentation, and RAM's open tags for open-vocabulary 3D scene understanding.

Findings

01

Outperforms existing open-vocabulary methods on ScanNet200 and nuScenes datasets.

02

Enables zero-shot 3D scene understanding without prior training.

03

Demonstrates robustness in open-world environments.

Abstract

Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent works have sought to transfer knowledge embedded in vision-language models from 2D to 3D domains. However, these approaches often require prior knowledge from specific 3D scene datasets, limiting their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without training. In this paper, we introduce OV-SAM3D, a training-free method that contains a universal framework for understanding open-vocabulary 3D scenes. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Image Processing and 3D Reconstruction

MethodsSegment Anything Model