Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou, Tom Ko, Hansi Yang, James Kwok, Yu Zhang, Mingxuan Wang

TL;DR
This paper introduces EPIC, a novel vision-language pre-training method that emphasizes image-token consistency, addressing limitations of existing approaches by focusing on salient tokens and their visual relevance, leading to improved downstream task performance.
Contribution
EPIC proposes a new masking and token consistency strategy that enhances vision-language alignment beyond traditional masked language modeling methods.
Findings
EPIC improves performance when combined with state-of-the-art pre-training methods.
EPIC effectively reduces modality bias in vision-language models.
Experimental results show significant gains on downstream tasks.
Abstract
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsALBEF
