Framewise approach in multimodal emotion recognition in OMG challenge
Grigoriy Sterling, Andrey Belyaev, Maxim Ryabov

TL;DR
This paper presents a multimodal emotion recognition approach using ensemble models on voice and face data, achieving over 53% accuracy in the OMG challenge by combining features extracted with neural networks and decision-level fusion.
Contribution
The study introduces a framewise multimodal emotion recognition method employing ensemble neural networks trained on separate voice and face streams, with decision-level fusion improving accuracy.
Findings
Achieved 53% unweighted accuracy on 7 emotions.
Reduced mean squared errors to 0.05 and 0.09 for arousal and valence.
Ensemble fusion improved single-modality results.
Abstract
In this report we described our approach achieves of unweighted accuracy over emotions and and mean squared errors for arousal and valence in OMG emotion recognition challenge. Our results were obtained with ensemble of single modality models trained on voice and face data from video separately. We consider each stream as a sequence of frames. Next we estimated features from frames and handle it with recurrent neural network. As audio frame we mean short second spectrogram interval. For features estimation for face pictures we used own ResNet neural network pretrained on AffectNet database. Each short spectrogram was considered as a picture and processed by convolutional network too. As a base audio model we used ResNet pretrained in speaker recognition task. Predictions from both modalities were fused on decision level and improve single-channel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection
