Towards Universal End-to-End Affect Recognition from Multilingual Speech by ConvNets
Dario Bertero, Onno Kampman, Pascale Fung

TL;DR
This paper introduces a universal end-to-end CNN model for affect recognition from multilingual speech, leveraging raw waveforms to improve emotion and personality detection across languages.
Contribution
It presents the first universal CNN-based affect recognition model trained on multiple languages simultaneously, outperforming single-language models and spectrogram-based CNNs.
Findings
12.8% improvement in emotion recognition accuracy
10.1% improvement in personality recognition accuracy
Network learns language-independent features like pitch and energy
Abstract
We propose an end-to-end affect recognition approach using a Convolutional Neural Network (CNN) that handles multiple languages, with applications to emotion and personality recognition from speech. We lay the foundation of a universal model that is trained on multiple languages at once. As affect is shared across all languages, we are able to leverage shared information between languages and improve the overall performance for each one. We obtained an average improvement of 12.8% on emotion and 10.1% on personality when compared with the same model trained on each language only. It is end-to-end because we directly take narrow-band raw waveforms as input. This allows us to accept as input audio recorded from any source and to avoid the overhead and information loss of feature extraction. It outperforms a similar CNN using spectrograms as input by 12.8% for emotion and 6.3% for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis
