GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language   Pre-training

Jaeseok Byun; Taebaek Hwang; Jianlong Fu; and Taesup Moon

arXiv:2208.04060·cs.CV·August 9, 2022

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

Jaeseok Byun, Taebaek Hwang, Jianlong Fu, and Taesup Moon

PDF

Open Access 1 Repo

TL;DR

GRIT-VLP introduces an adaptive mini-batch sampling strategy and enhanced masking techniques for more efficient vision-language pre-training, achieving state-of-the-art results with reduced computational costs.

Contribution

The paper proposes a novel grouped mini-batch sampling method and auxiliary loss to improve hard negative mining in vision-language pre-training, reducing training time and computational resources.

Findings

01

Achieves state-of-the-art performance on downstream tasks.

02

Uses only one-third of the training epochs compared to previous models.

03

Demonstrates effective hard negative mining with lower computational cost.

Abstract

Most of the currently existing vision and language pre-training (VLP) methods have mainly focused on how to extract and align vision and text features. In contrast to the mainstream VLP methods, we highlight that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model: in-batch hard negative sampling for image-text matching (ITM) and assigning the large masking probability for the masked language modeling (MLM). After empirically showing the unexpected effectiveness of above two steps, we systematically devise our GRIT-VLP, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM while maintaining the computational cost for pre-training. Our method consists of three components: 1) GRouped mIni-baTch sampling (GRIT) strategy that collects similar examples in a mini-batch, 2) ITC consistency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaeseokbyun/grit-vlp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsALBEF · ALIGN