Improving Speech Emotion Recognition with Unsupervised Speaking Style   Transfer

Leyuan Qu; Wei Wang; Cornelius Weber; Pengcheng Yue; Taihao Li and; Stefan Wermter

arXiv:2211.08843·cs.SD·December 29, 2023

Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer

Leyuan Qu, Wei Wang, Cornelius Weber, Pengcheng Yue, Taihao Li and, Stefan Wermter

PDF

Open Access

TL;DR

This paper introduces EmoAug, an unsupervised style transfer model that enhances speech emotion recognition by augmenting data with varied prosodic styles, improving accuracy and addressing data imbalance.

Contribution

EmoAug is a novel unsupervised style transfer model that enriches emotional speech data with diverse prosodic attributes to improve speech emotion recognition performance.

Findings

01

EmoAug successfully transfers speaking styles while preserving speaker identity and content.

02

Data augmentation with EmoAug improves SER accuracy beyond state-of-the-art methods.

03

The augmented model mitigates overfitting caused by data imbalance.

Abstract

Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings