DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D   Classification

Sitian Shen; Zilin Zhu; Linqian Fan; Harry Zhang; Xinxiao Wu

arXiv:2305.15957·cs.CV·May 7, 2024·1 cites

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, Xinxiao Wu

PDF

Open Access 1 Video

TL;DR

DiffCLIP introduces a novel framework combining stable diffusion and style prompts to improve zero-shot 3D classification, bridging the domain gap between 2D pre-trained models and 3D data.

Contribution

The paper proposes DiffCLIP, integrating stable diffusion with ControlNet and style prompts for enhanced 3D understanding and zero-shot classification performance.

Findings

01

Achieves 43.2% zero-shot accuracy on ScanObjectNN OBJ_BG

02

Attains 80.6% zero-shot accuracy on ModelNet10

03

Demonstrates strong 3D understanding capabilities

Abstract

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsDiffusion · Contrastive Language-Image Pre-training