ViLTA: Enhancing Vision-Language Pre-training through Textual   Augmentation

Weihan Wang; Zhen Yang; Bin Xu; Juanzi Li; Yankui Sun

arXiv:2308.16689·cs.CV·September 1, 2023

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun

PDF

Open Access

TL;DR

ViLTA introduces a novel approach to vision-language pre-training that enhances model robustness and convergence speed through cross-distillation for MLM and hard negative synthesis for ITM, leading to improved performance.

Contribution

The paper proposes ViLTA, a new method with two components that improve fine-grained representation learning and robustness in vision-language pre-training.

Findings

01

Achieves better performance on vision-language benchmarks.

02

Enhances model robustness via cross-distillation in MLM.

03

Speeds up convergence with hard negative mining in ITM.

Abstract

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsALIGN · Focus