Framewise approach in multimodal emotion recognition in OMG challenge

Grigoriy Sterling; Andrey Belyaev; Maxim Ryabov

arXiv:1805.01369·cs.AI·May 4, 2018

Framewise approach in multimodal emotion recognition in OMG challenge

Grigoriy Sterling, Andrey Belyaev, Maxim Ryabov

PDF

Open Access

TL;DR

This paper presents a multimodal emotion recognition approach using ensemble models on voice and face data, achieving over 53% accuracy in the OMG challenge by combining features extracted with neural networks and decision-level fusion.

Contribution

The study introduces a framewise multimodal emotion recognition method employing ensemble neural networks trained on separate voice and face streams, with decision-level fusion improving accuracy.

Findings

01

Achieved 53% unweighted accuracy on 7 emotions.

02

Reduced mean squared errors to 0.05 and 0.09 for arousal and valence.

03

Ensemble fusion improved single-modality results.

Abstract

In this report we described our approach achieves $53%$ of unweighted accuracy over $7$ emotions and $0.05$ and $0.09$ mean squared errors for arousal and valence in OMG emotion recognition challenge. Our results were obtained with ensemble of single modality models trained on voice and face data from video separately. We consider each stream as a sequence of frames. Next we estimated features from frames and handle it with recurrent neural network. As audio frame we mean short $0.4$ second spectrogram interval. For features estimation for face pictures we used own ResNet neural network pretrained on AffectNet database. Each short spectrogram was considered as a picture and processed by convolutional network too. As a base audio model we used ResNet pretrained in speaker recognition task. Predictions from both modalities were fused on decision level and improve single-channel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection