MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition
Dan Song, Xinwei Fu, Ning Liu, Weizhi Nie, Wenhui Li, Lanjun Wang, You, Yang, Anan Liu

TL;DR
This paper introduces MV-CLIP, a multi-view approach that enhances zero-shot 3D shape recognition by using view selection and hierarchical prompts to improve confidence and accuracy without additional training.
Contribution
The paper proposes a novel multi-view CLIP-based method with view selection and hierarchical prompts for improved zero-shot 3D shape recognition, achieving state-of-the-art results without extra training.
Findings
Achieves 84.44% accuracy on ModelNet40
Achieves 91.51% accuracy on ModelNet10
Achieves 66.17% accuracy on ShapeNet Core55
Abstract
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Industrial Vision Systems and Defect Detection · Medical Image Segmentation Techniques
MethodsContrastive Language-Image Pre-training
