EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder

Xiaoshui Huang; Zhou Huang; Sheng Li; Wentao Qu; Tong He; Yuenan Hou,; Yifan Zuo; Wanli Ouyang

arXiv:2212.04098·cs.CV·December 12, 2023

EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder

Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou,, Yifan Zuo, Wanli Ouyang

PDF

Open Access 2 Repos

TL;DR

EPCL introduces a method to leverage frozen CLIP transformers for efficient 3D point cloud encoding, aligning 2D and 3D features without paired data, and demonstrates strong performance across multiple 3D tasks.

Contribution

The paper presents a novel point cloud tokenizer and a way to use frozen CLIP transformers for 3D tasks, reducing pretraining complexity and enabling cross-modal alignment.

Findings

01

Achieves 19.7 AP50 on ScanNet V2 detection

02

Improves 4.4 mIoU on S3DIS segmentation

03

Enhances 1.2 mIoU on SemanticKITTI segmentation

Abstract

The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces \textbf{E}fficient \textbf{P}oint \textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · 3D Shape Modeling and Analysis · Remote Sensing and LiDAR Applications

MethodsContrastive Language-Image Pre-training