LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
Anh-Quan Cao, Maximilian Jaritz, Matthieu Guillaumin, Raoul de, Charette, Loris Bazzani

TL;DR
LatteCLIP is an unsupervised fine-tuning approach for CLIP models that uses LMM-generated texts to improve domain-specific classification without human annotations, achieving significant accuracy gains.
Contribution
The paper introduces LatteCLIP, a novel unsupervised fine-tuning method leveraging LMM-generated descriptions and prototype learning to adapt CLIP to specific domains without annotations.
Findings
Outperforms zero-shot CLIP by +4.74% in top-1 accuracy.
Surpasses other unsupervised methods by +3.45% on average.
Effective across 10 domain-specific datasets.
Abstract
Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsContrastive Language-Image Pre-training
