Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection   Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao; Na Zhao; Jingjing Chen; Yu-Gang Jiang

arXiv:2407.05256·cs.CV·July 18, 2024

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper enhances open-vocabulary 3D object detection by integrating vision-language foundation models, enabling zero-shot discovery and hierarchical alignment to improve recognition of unseen objects in 3D scenes.

Contribution

It introduces a hierarchical alignment method and leverages vision foundation models for zero-shot object discovery, fully exploiting foundation models in OV-3DDet.

Findings

01

Significant accuracy improvements in open-vocabulary detection

02

Effective zero-shot discovery of novel objects in 3D scenes

03

Enhanced generalization to unseen categories

Abstract

Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques

MethodsALIGN