Multi-modal Residual Perceptron Network for Audio-Video Emotion   Recognition

Xin Chang; W{\l}adys{\l}aw Skarbek

arXiv:2107.10742·eess.SP·August 2, 2021·1 cites

Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition

Xin Chang, W{\l}adys{\l}aw Skarbek

PDF

Open Access

TL;DR

This paper introduces a Multi-modal Residual Perceptron Network that enhances audio-video emotion recognition by addressing noise and modality fusion issues, achieving state-of-the-art accuracy on benchmark datasets.

Contribution

The paper proposes a novel end-to-end multi-modal neural network architecture with time augmentation, improving emotion recognition accuracy over existing methods.

Findings

01

Achieved 91.4% accuracy on Ryerson AV dataset

02

Achieved 83.15% accuracy on Crowd-sourced dataset

03

Demonstrated potential for multi-modal applications beyond audio-visual data

Abstract

Audio-Video Emotion Recognition is now attacked with Deep Neural Network modeling tools. In published papers, as a rule, the authors show only cases of the superiority in multi-modality over audio-only or video-only modality. However, there are cases superiority in uni-modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the within-modal and inter-modal noisy information represented indirectly in the parameters of the modeling neural network impedes better performance in the existing late fusion and end-to-end multi-modal network training strategies. To take advantage and overcome the deficiencies in both solutions, we define a Multi-modal Residual Perceptron Network which performs end-to-end learning from multi-modal network branches, generalizing better multi-modal feature representation. For the proposed Multi-modal Residual Perceptron…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Emotion and Mood Recognition