ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised   Image-Text Data

Di Qi; Lin Su; Jia Song; Edward Cui; Taroon Bharti; Arun Sacheti

arXiv:2001.07966·cs.CV·January 24, 2020·156 cites

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti

PDF

Open Access

TL;DR

ImageBERT is a Transformer-based vision-language model pre-trained on large-scale weakly supervised image-text data, achieving state-of-the-art results on image and text retrieval tasks.

Contribution

The paper introduces a multi-stage pre-training strategy for ImageBERT using a large-scale weakly supervised dataset, improving cross-modal understanding.

Findings

01

Outperforms previous models on MSCOCO and Flickr30k retrieval tasks.

02

Multi-stage pre-training yields better results than single-stage.

03

Effective use of large-scale weakly supervised data enhances model performance.

Abstract

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques