Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition
Amirali Soltani Tehrani, Niloufar Faridani, Ramin Toosi

TL;DR
This paper demonstrates that combining self-supervised feature extraction with supervised CNN classification significantly improves speech emotion recognition accuracy, especially with small audio segments, surpassing traditional methods.
Contribution
It introduces a novel approach integrating Wav2Vec-based self-supervised features with CNNs for SER, outperforming baseline and transfer learning methods.
Findings
Outperforms baseline SVM and transfer learning CNN methods
Self-supervised features enhance emotion recognition accuracy
Superiority over state-of-the-art SER methods
Abstract
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Music and Audio Processing
