Self-Supervised Image Captioning with CLIP
Chuanyang Jin

TL;DR
This paper presents a self-supervised image captioning approach that leverages CLIP to generate high-quality captions using minimal labeled data, achieving performance comparable to fully supervised models.
Contribution
Introduces a novel self-supervised image captioning method that reduces reliance on large labeled datasets by utilizing CLIP for relevance enhancement.
Findings
Achieves comparable performance with less than 2% of labeled data.
Produces more distinctive and informative captions according to human evaluations.
Demonstrates effectiveness across standard benchmarks.
Abstract
Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which can be hard to obtain for many domains. To address this, we introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance between images and generated captions. Remarkably, despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset. Human evaluations further reveal that our method produces captions with greater distinctiveness and informativeness, two attributes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
