MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

Dan Song; Xinwei Fu; Ning Liu; Weizhi Nie; Wenhui Li; Lanjun Wang; You; Yang; Anan Liu

arXiv:2311.18402·cs.CV·September 12, 2024·1 cites

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

Dan Song, Xinwei Fu, Ning Liu, Weizhi Nie, Wenhui Li, Lanjun Wang, You, Yang, Anan Liu

PDF

Open Access

TL;DR

This paper introduces MV-CLIP, a multi-view approach that enhances zero-shot 3D shape recognition by using view selection and hierarchical prompts to improve confidence and accuracy without additional training.

Contribution

The paper proposes a novel multi-view CLIP-based method with view selection and hierarchical prompts for improved zero-shot 3D shape recognition, achieving state-of-the-art results without extra training.

Findings

01

Achieves 84.44% accuracy on ModelNet40

02

Achieves 91.51% accuracy on ModelNet10

03

Achieves 66.17% accuracy on ShapeNet Core55

Abstract

Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Industrial Vision Systems and Defect Detection · Medical Image Segmentation Techniques

MethodsContrastive Language-Image Pre-training