InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining
Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia, Yang

TL;DR
InterBERT introduces a novel multimodal pretraining model that effectively models interactions between vision and language modalities, achieving strong performance on downstream tasks and including the first Chinese multi-modal pretrained model.
Contribution
The paper presents InterBERT, the first model to effectively model vision-and-language interactions with a dual-module design, and introduces a large-scale Chinese multi-modal dataset for pretraining.
Findings
InterBERT outperforms recent multi-modal pretraining methods.
MSM and MRM tasks are effective for pretraining.
Chinese InterBERT achieves performance comparable to BERT in single-modal tasks.
Abstract
Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · InterBERT · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece
