Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities
Nathaniel Blanchard, Daniel Moreira, Aparna Bharati, Walter J., Scheirer

TL;DR
This paper introduces a scalable multimodal sentiment classification model that uses only high-level visual and audio features, avoiding transcription to improve deployability and effectiveness in analyzing spoken sentiment.
Contribution
The paper presents a novel multimodal fusion approach that relies solely on high-level video and audio features, demonstrating its effectiveness without traditional transcription features.
Findings
Achieved an F1 score of 0.8049 on validation set
Achieved an F1 score of 0.6325 on test set
Proves high-level features can effectively detect sentiment
Abstract
In the last decade, video blogs (vlogs) have become an extremely popular method through which people express sentiment. The ubiquitousness of these videos has increased the importance of multimodal fusion models, which incorporate video and audio features with traditional text features for automatic sentiment detection. Multimodal fusion offers a unique opportunity to build models that learn from the full depth of expression available to human viewers. In the detection of sentiment in these videos, acoustic and video features provide clarity to otherwise ambiguous transcripts. In this paper, we present a multimodal fusion model that exclusively uses high-level video and audio features to analyze spoken sentences for sentiment. We discard traditional transcription features in order to minimize human intervention and to maximize the deployability of our model on at-scale real-world data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
