Self-Supervised Image Captioning with CLIP

Chuanyang Jin

arXiv:2306.15111·cs.CV·November 3, 2023·1 cites

Self-Supervised Image Captioning with CLIP

Chuanyang Jin

PDF

Open Access

TL;DR

This paper presents a self-supervised image captioning approach that leverages CLIP to generate high-quality captions using minimal labeled data, achieving performance comparable to fully supervised models.

Contribution

Introduces a novel self-supervised image captioning method that reduces reliance on large labeled datasets by utilizing CLIP for relevance enhancement.

Findings

01

Achieves comparable performance with less than 2% of labeled data.

02

Produces more distinctive and informative captions according to human evaluations.

03

Demonstrates effectiveness across standard benchmarks.

Abstract

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which can be hard to obtain for many domains. To address this, we introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance between images and generated captions. Remarkably, despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset. Human evaluations further reveal that our method produces captions with greater distinctiveness and informativeness, two attributes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training