On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for   Multimodal Sentiment Analysis

Atsushi Ando; Ryo Masumura; Akihiko Takashima; Satoshi Suzuki; Naoki; Makishima; Keita Suzuki; Takafumi Moriya; Takanori Ashihara; Hiroshi Sato

arXiv:2210.15937·cs.CL·October 31, 2022

On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki, Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato

PDF

Open Access

TL;DR

This study evaluates the use of large-scale, modality-specific pre-trained encoders for multimodal sentiment analysis, demonstrating their superiority over traditional features and highlighting the benefits of using intermediate layer outputs.

Contribution

It is the first comprehensive comparison of large-scale pre-trained encoders across visual, acoustic, and linguistic modalities in multimodal sentiment analysis.

Findings

01

Pre-trained encoders outperform conventional features in unimodal and multimodal tasks.

02

Using intermediate layer outputs yields better performance than final layer outputs.

03

Domain-specific pre-trained encoders enhance sentiment analysis accuracy.

Abstract

This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded by large-scale pre-trained encoders with conventional heuristic features. One each of the largest pre-trained encoders publicly available for each modality are used; CLIP-ViT, WavLM, and BERT for visual, acoustic, and linguistic modalities, respectively. Experiments on two datasets reveal that methods with domain-specific pre-trained encoders attain better performance than those with conventional features in both unimodal and multimodal scenarios. We also find it better to use the outputs of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Speech Recognition and Synthesis · Emotion and Mood Recognition

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dense Connections · Linear Layer · Layer Normalization · Residual Connection