GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun, Taebaek Hwang, Jianlong Fu, and Taesup Moon

TL;DR
GRIT-VLP introduces an adaptive mini-batch sampling strategy and enhanced masking techniques for more efficient vision-language pre-training, achieving state-of-the-art results with reduced computational costs.
Contribution
The paper proposes a novel grouped mini-batch sampling method and auxiliary loss to improve hard negative mining in vision-language pre-training, reducing training time and computational resources.
Findings
Achieves state-of-the-art performance on downstream tasks.
Uses only one-third of the training epochs compared to previous models.
Demonstrates effective hard negative mining with lower computational cost.
Abstract
Most of the currently existing vision and language pre-training (VLP) methods have mainly focused on how to extract and align vision and text features. In contrast to the mainstream VLP methods, we highlight that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model: in-batch hard negative sampling for image-text matching (ITM) and assigning the large masking probability for the masked language modeling (MLM). After empirically showing the unexpected effectiveness of above two steps, we systematically devise our GRIT-VLP, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM while maintaining the computational cost for pre-training. Our method consists of three components: 1) GRouped mIni-baTch sampling (GRIT) strategy that collects similar examples in a mini-batch, 2) ITC consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsALBEF · ALIGN
