TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis
Zilong Wang, Zhaohong Wan, and Xiaojun Wan

TL;DR
TransModality introduces an end-to-end Transformer-based fusion approach for multimodal sentiment analysis, effectively capturing subtle cross-modal correlations and achieving state-of-the-art results on multiple datasets.
Contribution
The paper proposes a novel end-to-end Transformer-based fusion method, TransModality, for multimodal sentiment analysis, leveraging translation between modalities to improve joint representations.
Findings
Achieves state-of-the-art performance on CMU-MOSI, MELD, IEMOCAP datasets.
Demonstrates effectiveness of translation-based fusion in multimodal sentiment analysis.
Validates the model's superiority over existing fusion methods.
Abstract
Multimodal sentiment analysis is an important research area that predicts speaker's sentiment tendency through features extracted from textual, visual and acoustic modalities. The central challenge is the fusion method of the multimodal information. A variety of fusion methods have been proposed, but few of them adopt end-to-end translation models to mine the subtle correlation between modalities. Enlightened by recent success of Transformer in the area of machine translation, we propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis. We assume that translation between modalities contributes to a better joint representation of speaker's utterance. With Transformer, the learned features embody the information both from the source modality and the target modality. We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Music and Audio Processing · Emotion and Mood Recognition
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dropout · Dense Connections · Attention Is All You Need · Byte Pair Encoding · Label Smoothing · Multi-Head Attention
