Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer
Leyuan Qu, Wei Wang, Cornelius Weber, Pengcheng Yue, Taihao Li and, Stefan Wermter

TL;DR
This paper introduces EmoAug, an unsupervised style transfer model that enhances speech emotion recognition by augmenting data with varied prosodic styles, improving accuracy and addressing data imbalance.
Contribution
EmoAug is a novel unsupervised style transfer model that enriches emotional speech data with diverse prosodic attributes to improve speech emotion recognition performance.
Findings
EmoAug successfully transfers speaking styles while preserving speaker identity and content.
Data augmentation with EmoAug improves SER accuracy beyond state-of-the-art methods.
The augmented model mitigates overfitting caused by data imbalance.
Abstract
Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
