Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition
Juan D. S. Ortega, Mohammed Senoussaoui, Eric Granger, Marco, Pedersoli, Patrick Cardinal, Alessandro L. Koerich

TL;DR
This paper introduces a novel deep neural network architecture for multimodal emotion recognition that effectively fuses audio, video, and text data, outperforming existing fusion methods on a benchmark dataset.
Contribution
The paper proposes a new DNN architecture with independent and shared layers for multimodal fusion, improving emotion prediction accuracy over traditional fusion approaches.
Findings
Achieved higher CCC scores than state-of-the-art systems.
Demonstrated effectiveness on AVEC Sentiment Analysis dataset.
Improved prediction of arousal, valence, and liking.
Abstract
This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition. The proposed DNN architecture has independent and shared layers which aim to learn the representation for each modality, as well as the best combined representation to achieve the best prediction. Experimental results on the AVEC Sentiment Analysis in the Wild dataset indicate that the proposed DNN can achieve a higher level of Concordance Correlation Coefficient (CCC) than other state-of-the-art systems that perform early fusion of modalities at feature-level (i.e., concatenation) and late fusion at score-level (i.e., weighted average) fusion. The proposed DNN has achieved CCCs of 0.606, 0.534, and 0.170 on the development partition of the dataset for predicting arousal, valence and liking, respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition
