Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion   Recognition

Juan D. S. Ortega; Mohammed Senoussaoui; Eric Granger; Marco; Pedersoli; Patrick Cardinal; Alessandro L. Koerich

arXiv:1907.03196·cs.CV·July 9, 2019·43 cites

Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition

Juan D. S. Ortega, Mohammed Senoussaoui, Eric Granger, Marco, Pedersoli, Patrick Cardinal, Alessandro L. Koerich

PDF

Open Access

TL;DR

This paper introduces a novel deep neural network architecture for multimodal emotion recognition that effectively fuses audio, video, and text data, outperforming existing fusion methods on a benchmark dataset.

Contribution

The paper proposes a new DNN architecture with independent and shared layers for multimodal fusion, improving emotion prediction accuracy over traditional fusion approaches.

Findings

01

Achieved higher CCC scores than state-of-the-art systems.

02

Demonstrated effectiveness on AVEC Sentiment Analysis dataset.

03

Improved prediction of arousal, valence, and liking.

Abstract

This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition. The proposed DNN architecture has independent and shared layers which aim to learn the representation for each modality, as well as the best combined representation to achieve the best prediction. Experimental results on the AVEC Sentiment Analysis in the Wild dataset indicate that the proposed DNN can achieve a higher level of Concordance Correlation Coefficient (CCC) than other state-of-the-art systems that perform early fusion of modalities at feature-level (i.e., concatenation) and late fusion at score-level (i.e., weighted average) fusion. The proposed DNN has achieved CCCs of 0.606, 0.534, and 0.170 on the development partition of the dataset for predicting arousal, valence and liking, respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition