CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision
Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente, Ordonez

TL;DR
CLIP-Lite introduces an efficient contrastive learning method that reduces data and batch size requirements while outperforming CLIP in various visual recognition tasks by leveraging an information-theoretic approach with language supervision.
Contribution
It presents a novel, data-efficient contrastive learning framework that requires only one negative sample per positive, improving performance over CLIP with less data and computational resources.
Findings
+14.0% mAP on Pascal VOC classification
+22.1% top-1 accuracy on ImageNet
Superior performance in image/text retrieval and zero-shot tasks
Abstract
We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning · Contrastive Language-Image Pre-training
