Learning Visual Representations with Caption Annotations

Mert Bulent Sariyildiz; Julien Perez; Diane Larlus

arXiv:2008.01392·cs.CV·August 5, 2020

Learning Visual Representations with Caption Annotations

Mert Bulent Sariyildiz, Julien Perez, Diane Larlus

PDF

TL;DR

This paper introduces a novel pretraining method called ICMLM that uses captioned images to learn visual representations by predicting masked words in captions based on visual cues, enabling effective transfer to various vision tasks.

Contribution

It proposes a new image-caption based pretraining approach, ICMLM, leveraging caption annotations to improve visual feature learning without extensive manual labeling.

Findings

01

Visual representations transfer well to multiple tasks

02

Caption data injects semantic information into visual features

03

ICMLM outperforms some existing pretraining methods

Abstract

Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining. Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations. To do so, motivated by the recent progresses in language models, we introduce {\em image-conditioned masked language modeling} (ICMLM) -- a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.