CLIP-Lite: Information Efficient Visual Representation Learning with   Language Supervision

Aman Shrivastava; Ramprasaath R. Selvaraju; Nikhil Naik; Vicente; Ordonez

arXiv:2112.07133·cs.CV·May 12, 2023·1 cites

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente, Ordonez

PDF

Open Access 1 Repo

TL;DR

CLIP-Lite introduces an efficient contrastive learning method that reduces data and batch size requirements while outperforming CLIP in various visual recognition tasks by leveraging an information-theoretic approach with language supervision.

Contribution

It presents a novel, data-efficient contrastive learning framework that requires only one negative sample per positive, improving performance over CLIP with less data and computational resources.

Findings

01

+14.0% mAP on Pascal VOC classification

02

+22.1% top-1 accuracy on ImageNet

03

Superior performance in image/text retrieval and zero-shot tasks

Abstract

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

4m4n5/CLIP-Lite
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning · Contrastive Language-Image Pre-training