Learning by Hallucinating: Vision-Language Pre-training with Weak   Supervision

Tzu-Jui Julius Wang; Jorma Laaksonen; Tomas Langer; Heikki Arponen,; and Tom E. Bishop

arXiv:2210.13591·cs.CV·October 28, 2022

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Tzu-Jui Julius Wang, Jorma Laaksonen, Tomas Langer, Heikki Arponen,, and Tom E. Bishop

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel Visual Vocabulary based Feature Hallucinator (WFH) that generates visual hallucinations from texts to improve weakly-supervised vision-language pre-training, significantly enhancing cross-modal retrieval and other tasks without requiring paired data.

Contribution

The paper proposes WFH, a new method for weakly-supervised vision-language pre-training that generates visual features from texts, enabling better cross-modal alignment without paired image-caption data.

Findings

01

Consistently improves retrieval performance on Flickr30K and MSCOCO datasets.

02

Enhances cross-dataset generalization by at least 14.5%.

03

Achieves comparable results to models trained with paired data in various downstream tasks.

Abstract

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning