Benchmarking Multimodal Sentiment Analysis
Erik Cambria, Devamanyu Hazarika, Soujanya Poria, Amir Hussain, R.B.V., Subramaanyam

TL;DR
This paper introduces a new benchmark framework for multimodal sentiment analysis using CNN-based features from text, visual, and audio data, achieving a 10% performance boost over existing methods.
Contribution
It presents a comprehensive benchmark for multimodal sentiment analysis, emphasizing the importance of modality roles, speaker independence, and generalizability.
Findings
10% performance improvement over state-of-the-art
Highlights key issues like modality importance and speaker independence
Provides a new benchmark for future research
Abstract
We propose a framework for multimodal sentiment analysis and emotion recognition using convolutional neural network-based feature extraction from text and visual modalities. We obtain a performance improvement of 10% over the state of the art by combining visual, text and audio features. We also discuss some major issues frequently ignored in multimodal sentiment analysis research: the role of speaker-independent models, importance of the modalities and generalizability. The paper thus serve as a new benchmark for further research in multimodal sentiment analysis and also demonstrates the different facets of analysis to be considered while performing such tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
