Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention
Ting Wu, Junjie Peng, Wenqiang Zhang, Huiran Zhang, Chuanshuai Ma,, Yansong Huang

TL;DR
This paper introduces a multi-head attention based fusion network for multimodal sentiment analysis, effectively combining textual, visual, and acoustic signals to improve prediction accuracy and interpretability.
Contribution
It proposes a novel multi-head attention fusion network that models pairwise modality interactions with residual connections, enhancing sentiment analysis performance.
Findings
Outperforms existing methods on four public datasets
Effectively models pairwise modality interactions
Provides interpretability of bimodal contributions
Abstract
Humans express feelings or emotions via different channels. Take language as an example, it entails different sentiments under different visual-acoustic contexts. To precisely understand human intentions as well as reduce the misunderstandings caused by ambiguity and sarcasm, we should consider multimodal signals including textual, visual and acoustic signals. The crucial challenge is to fuse different modalities of features for sentiment analysis. To effectively fuse the information carried by different modalities and better predict the sentiments, we design a novel multi-head attention based fusion network, which is inspired by the observations that the interactions between any two pair-wise modalities are different and they do not equally contribute to the final sentiment prediction. By assigning the acoustic-visual, acoustic-textual and visual-textual features with reasonable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
