Unsupervised Representations Improve Supervised Learning in Speech   Emotion Recognition

Amirali Soltani Tehrani; Niloufar Faridani; Ramin Toosi

arXiv:2309.12714·eess.AS·September 25, 2023·1 cites

Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition

Amirali Soltani Tehrani, Niloufar Faridani, Ramin Toosi

PDF

Open Access

TL;DR

This paper demonstrates that combining self-supervised feature extraction with supervised CNN classification significantly improves speech emotion recognition accuracy, especially with small audio segments, surpassing traditional methods.

Contribution

It introduces a novel approach integrating Wav2Vec-based self-supervised features with CNNs for SER, outperforming baseline and transfer learning methods.

Findings

01

Outperforms baseline SVM and transfer learning CNN methods

02

Self-supervised features enhance emotion recognition accuracy

03

Superiority over state-of-the-art SER methods

Abstract

Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Music and Audio Processing