Exchanging-based Multimodal Fusion with Transformer
Renyu Zhu, Chengcheng Han, Yong Qian, Qiushi Sun, Xiang Li, Ming Gao,, Xuezhi Cao, Yunsen Xian

TL;DR
This paper introduces MuSE, a novel Transformer-based multimodal fusion model that exchanges embeddings between text and vision modalities, improving performance on multimodal tasks.
Contribution
The paper proposes MuSE, a new exchanging-based multimodal fusion model using Transformers, capable of handling sequential data and effectively capturing cross-modal correlations.
Findings
MuSE outperforms existing methods on NER and sentiment analysis tasks.
The model effectively exchanges information between text and vision modalities.
Experimental results demonstrate the superiority of MuSE over competitors.
Abstract
We study the problem of multimodal fusion in this paper. Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other. However, most of them project inputs of multimodalities into different low-dimensional spaces and cannot be applied to the sequential input data. To solve these issues, in this paper, we propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer. We first use two encoders to separately map multimodal inputs into different low-dimensional spaces. Then we employ two decoders to regularize the embeddings and pull them into the same space. The two decoders capture the correlations between texts and images with the image captioning task and the text-to-image generation task, respectively. Further, based on the regularized embeddings, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Softmax · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Byte Pair Encoding · Dropout
