Improving CLIP Training with Language Rewrites
Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, Yonglong Tian

TL;DR
This paper introduces LaCLIP, a method that enhances CLIP training by using language rewrites generated by large language models to diversify text descriptions, leading to significant improvements in transfer performance.
Contribution
The paper proposes a novel language augmentation technique for CLIP training using language rewrites, which improves transfer accuracy without additional computational costs.
Findings
LaCLIP outperforms CLIP in zero-shot ImageNet accuracy by up to 8.2%.
Language rewrites increase diversity of text inputs, enhancing model robustness.
The method requires no extra computation or memory during training.
Abstract
Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
