Leveraging per Image-Token Consistency for Vision-Language Pre-training

Yunhao Gou; Tom Ko; Hansi Yang; James Kwok; Yu Zhang; Mingxuan Wang

arXiv:2211.15398·cs.CV·September 6, 2023·1 cites

Leveraging per Image-Token Consistency for Vision-Language Pre-training

Yunhao Gou, Tom Ko, Hansi Yang, James Kwok, Yu Zhang, Mingxuan Wang

PDF

Open Access

TL;DR

This paper introduces EPIC, a novel vision-language pre-training method that emphasizes image-token consistency, addressing limitations of existing approaches by focusing on salient tokens and their visual relevance, leading to improved downstream task performance.

Contribution

EPIC proposes a new masking and token consistency strategy that enhances vision-language alignment beyond traditional masked language modeling methods.

Findings

01

EPIC improves performance when combined with state-of-the-art pre-training methods.

02

EPIC effectively reduces modality bias in vision-language models.

03

Experimental results show significant gains on downstream tasks.

Abstract

Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsALBEF