ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun

TL;DR
ViLTA introduces a novel approach to vision-language pre-training that enhances model robustness and convergence speed through cross-distillation for MLM and hard negative synthesis for ITM, leading to improved performance.
Contribution
The paper proposes ViLTA, a new method with two components that improve fine-grained representation learning and robustness in vision-language pre-training.
Findings
Achieves better performance on vision-language benchmarks.
Enhances model robustness via cross-distillation in MLM.
Speeds up convergence with hard negative mining in ITM.
Abstract
Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsALIGN · Focus
