InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Junyang Lin; An Yang; Yichang Zhang; Jie Liu; Jingren Zhou; Hongxia; Yang

arXiv:2003.13198·cs.CL·April 23, 2021·56 cites

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia, Yang

PDF

Open Access

TL;DR

InterBERT introduces a novel multimodal pretraining model that effectively models interactions between vision and language modalities, achieving strong performance on downstream tasks and including the first Chinese multi-modal pretrained model.

Contribution

The paper presents InterBERT, the first model to effectively model vision-and-language interactions with a dual-module design, and introduces a large-scale Chinese multi-modal dataset for pretraining.

Findings

01

InterBERT outperforms recent multi-modal pretraining methods.

02

MSM and MRM tasks are effective for pretraining.

03

Chinese InterBERT achieves performance comparable to BERT in single-modal tasks.

Abstract

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · InterBERT · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece