Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition
Xin Chang, W{\l}adys{\l}aw Skarbek

TL;DR
This paper introduces a Multi-modal Residual Perceptron Network that enhances audio-video emotion recognition by addressing noise and modality fusion issues, achieving state-of-the-art accuracy on benchmark datasets.
Contribution
The paper proposes a novel end-to-end multi-modal neural network architecture with time augmentation, improving emotion recognition accuracy over existing methods.
Findings
Achieved 91.4% accuracy on Ryerson AV dataset
Achieved 83.15% accuracy on Crowd-sourced dataset
Demonstrated potential for multi-modal applications beyond audio-visual data
Abstract
Audio-Video Emotion Recognition is now attacked with Deep Neural Network modeling tools. In published papers, as a rule, the authors show only cases of the superiority in multi-modality over audio-only or video-only modality. However, there are cases superiority in uni-modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the within-modal and inter-modal noisy information represented indirectly in the parameters of the modeling neural network impedes better performance in the existing late fusion and end-to-end multi-modal network training strategies. To take advantage and overcome the deficiencies in both solutions, we define a Multi-modal Residual Perceptron Network which performs end-to-end learning from multi-modal network branches, generalizing better multi-modal feature representation. For the proposed Multi-modal Residual Perceptron…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Emotion and Mood Recognition
