On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis
Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki, Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato

TL;DR
This study evaluates the use of large-scale, modality-specific pre-trained encoders for multimodal sentiment analysis, demonstrating their superiority over traditional features and highlighting the benefits of using intermediate layer outputs.
Contribution
It is the first comprehensive comparison of large-scale pre-trained encoders across visual, acoustic, and linguistic modalities in multimodal sentiment analysis.
Findings
Pre-trained encoders outperform conventional features in unimodal and multimodal tasks.
Using intermediate layer outputs yields better performance than final layer outputs.
Domain-specific pre-trained encoders enhance sentiment analysis accuracy.
Abstract
This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded by large-scale pre-trained encoders with conventional heuristic features. One each of the largest pre-trained encoders publicly available for each modality are used; CLIP-ViT, WavLM, and BERT for visual, acoustic, and linguistic modalities, respectively. Experiments on two datasets reveal that methods with domain-specific pre-trained encoders attain better performance than those with conventional features in both unimodal and multimodal scenarios. We also find it better to use the outputs of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Speech Recognition and Synthesis · Emotion and Mood Recognition
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dense Connections · Linear Layer · Layer Normalization · Residual Connection
