Exchanging-based Multimodal Fusion with Transformer

Renyu Zhu; Chengcheng Han; Yong Qian; Qiushi Sun; Xiang Li; Ming Gao,; Xuezhi Cao; Yunsen Xian

arXiv:2309.02190·cs.CV·September 6, 2023·1 cites

Exchanging-based Multimodal Fusion with Transformer

Renyu Zhu, Chengcheng Han, Yong Qian, Qiushi Sun, Xiang Li, Ming Gao,, Xuezhi Cao, Yunsen Xian

PDF

Open Access 1 Repo

TL;DR

This paper introduces MuSE, a novel Transformer-based multimodal fusion model that exchanges embeddings between text and vision modalities, improving performance on multimodal tasks.

Contribution

The paper proposes MuSE, a new exchanging-based multimodal fusion model using Transformers, capable of handling sequential data and effectively capturing cross-modal correlations.

Findings

01

MuSE outperforms existing methods on NER and sentiment analysis tasks.

02

The model effectively exchanges information between text and vision modalities.

03

Experimental results demonstrate the superiority of MuSE over competitors.

Abstract

We study the problem of multimodal fusion in this paper. Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other. However, most of them project inputs of multimodalities into different low-dimensional spaces and cannot be applied to the sequential input data. To solve these issues, in this paper, we propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer. We first use two encoders to separately map multimodal inputs into different low-dimensional spaces. Then we employ two decoders to regularize the embeddings and pull them into the same space. The two decoders capture the correlations between texts and images with the image captioning task and the text-to-image generation task, respectively. Further, based on the regularized embeddings, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

recklessronan/muse
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Softmax · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Byte Pair Encoding · Dropout