Switch-BERT: Learning to Model Multimodal Interactions by Switching   Attention and Input

Qingpei Guo; Kaisheng Yao; Wei Chu

arXiv:2306.14182·cs.CV·June 27, 2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input

Qingpei Guo, Kaisheng Yao, Wei Chu

PDF

Open Access

TL;DR

Switch-BERT introduces a flexible, learnable multimodal interaction model that dynamically switches attention modes and layers, improving performance across vision-language tasks by mitigating modality mismatch.

Contribution

It extends BERT with learnable layer-wise and cross-layer interactions, enabling adaptive modeling of intra- and inter-modal relationships in multimodal learning.

Findings

01

Outperforms or matches state-of-the-art in VQA, image-text retrieval, and referring expression tasks.

02

Learns to attend outputs from various depths, reducing modality mismatch.

03

Achieves superior task-specific multimodal interaction modeling.

Abstract

The ability to model intra-modal and inter-modal interactions is fundamental in multimodal machine learning. The current state-of-the-art models usually adopt deep learning models with fixed structures. They can achieve exceptional performances on specific tasks, but face a particularly challenging problem of modality mismatch because of diversity of input modalities and their fixed structures. In this paper, we present \textbf{Switch-BERT} for joint vision and language representation learning to address this problem. Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions. It learns to optimize attention from a set of attention modes representing these interactions. One specific property of the model is that it learns to attend outputs from various depths, therefore mitigates the modality mismatch problem. We present extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Attention Dropout · WordPiece · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Residual Connection