ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti

TL;DR
ImageBERT is a Transformer-based vision-language model pre-trained on large-scale weakly supervised image-text data, achieving state-of-the-art results on image and text retrieval tasks.
Contribution
The paper introduces a multi-stage pre-training strategy for ImageBERT using a large-scale weakly supervised dataset, improving cross-modal understanding.
Findings
Outperforms previous models on MSCOCO and Flickr30k retrieval tasks.
Multi-stage pre-training yields better results than single-stage.
Effective use of large-scale weakly supervised data enhances model performance.
Abstract
In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
