PointCLIP: Point Cloud Understanding by CLIP

Renrui Zhang; Ziyu Guo; Wei Zhang; Kunchang Li; Xupeng Miao; Bin Cui,; Yu Qiao; Peng Gao; Hongsheng Li

arXiv:2112.02413·cs.CV·December 7, 2021

PointCLIP: Point Cloud Understanding by CLIP

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui,, Yu Qiao, Peng Gao, Hongsheng Li

PDF

2 Repos

TL;DR

PointCLIP leverages CLIP's image-text alignment for 3D point cloud recognition by multi-view projection and adaptive fusion, enabling effective zero-shot and few-shot 3D understanding with minimal training.

Contribution

This work introduces PointCLIP, a novel method that adapts CLIP for 3D point cloud recognition through multi-view encoding and an inter-view adapter, achieving strong zero-shot and few-shot performance.

Findings

01

PointCLIP outperforms classical 3D-supervised networks in experiments.

02

Ensembling PointCLIP with traditional models boosts overall accuracy.

03

PointCLIP surpasses state-of-the-art models on ModelNet and ScanObjectNN datasets.

Abstract

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts. Specifically, we encode a point cloud by projecting it into multi-view depth maps without rendering, and aggregate the view-wise zero-shot prediction to achieve knowledge transfer from 2D to 3D. On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training · Adapter