PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Xiangyang Zhu; Renrui Zhang; Bowei He; Ziyu Guo; Ziyao Zeng; Zipeng; Qin; Shanghang Zhang; Peng Gao

arXiv:2211.11682·cs.CV·August 29, 2023·22 cites

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng, Qin, Shanghang Zhang, Peng Gao

PDF

Open Access 2 Repos

TL;DR

PointCLIP V2 leverages prompt engineering with CLIP and GPT to enable powerful zero-shot and few-shot 3D open-world learning tasks such as classification, segmentation, and detection without domain-specific training.

Contribution

The paper introduces PointCLIP V2, a novel framework that unifies CLIP and GPT for enhanced zero-shot 3D understanding through innovative prompting and domain alignment techniques.

Findings

01

Achieves over 40% accuracy improvements in zero-shot 3D classification.

02

Effectively extends to few-shot classification, segmentation, and detection tasks.

03

Demonstrates strong generalization in 3D open-world learning scenarios.

Abstract

Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Weight Decay · Linear Layer · Attention Dropout · Softmax · Dense Connections · Residual Connection · Discriminative Fine-Tuning