Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos
Lianyang Ma, Yu Yao, Tao Liang, Tongliang Liu

TL;DR
This paper introduces a multi-scale cooperative transformer architecture for multimodal sentiment analysis in videos, leveraging multi-level semantic features for improved crossmodal interaction and robustness.
Contribution
It proposes a novel multi-scale cooperative transformer that exploits multi-level semantic features for better multimodal fusion in sentiment analysis.
Findings
Outperforms existing methods on unaligned multimodal sequences
Achieves strong performance on aligned multimodal sequences
Enhances robustness of multimodal sentiment analysis
Abstract
Multimodal sentiment analysis in videos is a key task in many real-world applications, which usually requires integrating multimodal streams including visual, verbal and acoustic behaviors. To improve the robustness of multimodal fusion, some of the existing methods let different modalities communicate with each other and modal the crossmodal interaction via transformers. However, these methods only use the single-scale representations during the interaction but forget to exploit multi-scale representations that contain different levels of semantic information. As a result, the representations learned by transformers could be biased especially for unaligned multimodal data. In this paper, we propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis. On the whole, the "multi-scale" mechanism is capable of exploiting the different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Emotion and Mood Recognition · Advanced Computing and Algorithms
