Multimodal Speech Emotion Recognition and Ambiguity Resolution
Gaurav Sahu

TL;DR
This paper compares traditional machine learning and deep learning models for speech emotion recognition using hand-crafted audio features and explores ambiguity resolution by incorporating text features, achieving comparable performance to state-of-the-art methods.
Contribution
It demonstrates that simple machine learning models with hand-crafted features can match deep learning performance in speech emotion recognition.
Findings
Traditional models achieve comparable accuracy to deep learning methods.
Inclusion of text features helps resolve communication ambiguity.
Lightweight models are effective for emotion recognition tasks.
Abstract
Identifying emotion from speech is a non-trivial task pertaining to the ambiguous definition of emotion itself. In this work, we adopt a feature-engineering based approach to tackle the task of speech emotion recognition. Formalizing our problem as a multi-class classification problem, we compare the performance of two categories of models. For both, we extract eight hand-crafted features from the audio signal. In the first approach, the extracted features are used to train six traditional machine learning classifiers, whereas the second approach is based on deep learning wherein a baseline feed-forward neural network and an LSTM-based classifier are trained over the same features. In order to resolve ambiguity in communication, we also include features from the text domain. We report accuracy, f-score, precision, and recall for the different experiment settings we evaluated our models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis
