Learning Visual Representations with Caption Annotations
Mert Bulent Sariyildiz, Julien Perez, Diane Larlus

TL;DR
This paper introduces a novel pretraining method called ICMLM that uses captioned images to learn visual representations by predicting masked words in captions based on visual cues, enabling effective transfer to various vision tasks.
Contribution
It proposes a new image-caption based pretraining approach, ICMLM, leveraging caption annotations to improve visual feature learning without extensive manual labeling.
Findings
Visual representations transfer well to multiple tasks
Caption data injects semantic information into visual features
ICMLM outperforms some existing pretraining methods
Abstract
Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining. Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations. To do so, motivated by the recent progresses in language models, we introduce {\em image-conditioned masked language modeling} (ICMLM) -- a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
