LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

Anh-Quan Cao; Maximilian Jaritz; Matthieu Guillaumin; Raoul de; Charette; Loris Bazzani

arXiv:2410.08211·cs.CV·October 11, 2024

LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

Anh-Quan Cao, Maximilian Jaritz, Matthieu Guillaumin, Raoul de, Charette, Loris Bazzani

PDF

Open Access

TL;DR

LatteCLIP is an unsupervised fine-tuning approach for CLIP models that uses LMM-generated texts to improve domain-specific classification without human annotations, achieving significant accuracy gains.

Contribution

The paper introduces LatteCLIP, a novel unsupervised fine-tuning method leveraging LMM-generated descriptions and prototype learning to adapt CLIP to specific domains without annotations.

Findings

01

Outperforms zero-shot CLIP by +4.74% in top-1 accuracy.

02

Surpasses other unsupervised methods by +3.45% on average.

03

Effective across 10 domain-specific datasets.

Abstract

Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsContrastive Language-Image Pre-training