No Captions, No Problem: Captionless 3D-CLIP Alignment with Hard   Negatives via CLIP Knowledge and LLMs

Cristian Sbrolli; Matteo Matteucci

arXiv:2406.02202·cs.CV·September 10, 2024

No Captions, No Problem: Captionless 3D-CLIP Alignment with Hard Negatives via CLIP Knowledge and LLMs

Cristian Sbrolli, Matteo Matteucci

PDF

Open Access

TL;DR

This paper introduces unsupervised methods leveraging CLIP and LLMs to improve 3D-contrastive alignment without textual descriptions, achieving state-of-the-art results in 3D classification and retrieval tasks.

Contribution

The paper proposes two novel unsupervised methods, I2I and (I2L)^2, for 3D-contrastive learning that do not require textual annotations, utilizing CLIP knowledge and hard negative mining.

Findings

01

Achieves comparable or better 3D classification accuracy in zero-shot settings.

02

Significantly improves image-to-shape and shape-to-image retrieval performance.

03

Demonstrates effectiveness of unsupervised, captionless 3D alignment methods.

Abstract

In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, $I 2 I$ and $(I 2 L)^{2}$ , which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training