EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder
Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou,, Yifan Zuo, Wanli Ouyang

TL;DR
EPCL introduces a method to leverage frozen CLIP transformers for efficient 3D point cloud encoding, aligning 2D and 3D features without paired data, and demonstrates strong performance across multiple 3D tasks.
Contribution
The paper presents a novel point cloud tokenizer and a way to use frozen CLIP transformers for 3D tasks, reducing pretraining complexity and enabling cross-modal alignment.
Findings
Achieves 19.7 AP50 on ScanNet V2 detection
Improves 4.4 mIoU on S3DIS segmentation
Enhances 1.2 mIoU on SemanticKITTI segmentation
Abstract
The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces \textbf{E}fficient \textbf{P}oint \textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · 3D Shape Modeling and Analysis · Remote Sensing and LiDAR Applications
MethodsContrastive Language-Image Pre-training
