Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning
Deniz Cevher, Sebastian Zepf, Roman Klinger

TL;DR
This paper explores multimodal emotion recognition in in-car German speech using transfer learning, showing that combining audio, facial, and textual signals improves accuracy but face and audio tools need refinement for in-car contexts.
Contribution
It introduces a transfer learning approach for emotion recognition from text in in-car settings and compares it with existing audio and face analysis tools.
Findings
Transfer learning improves emotion recognition by up to 10 percentage points in F1 score.
Models achieve up to 76 micro-average F1 across emotions.
Off-the-shelf face and audio tools are not yet suitable for in-car emotion detection.
Abstract
The recognition of emotions by humans is a complex process which considers multiple interacting signals such as facial expressions and both prosody and semantic content of utterances. Commonly, research on automatic recognition of emotions is, with few exceptions, limited to one modality. We describe an in-car experiment for emotion recognition from speech interactions for three modalities: the audio signal of a spoken interaction, the visual signal of the driver's face, and the manually transcribed content of utterances of the driver. We use off-the-shelf tools for emotion detection in audio and face and compare that to a neural transfer learning approach for emotion recognition from text which utilizes existing resources from other domains. We see that transfer learning enables models based on out-of-domain corpora to perform well. This method contributes up to 10 percentage points in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
