StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis
Rui Liu, Berrak Sisman, Haizhou Li

TL;DR
StrengthNet is a deep learning model designed to accurately assess emotion strength in speech, improving generalization across different datasets for more realistic emotional speech synthesis.
Contribution
The paper introduces StrengthNet, a multi-task learning framework with data augmentation, enhancing emotion strength prediction accuracy and generalization in speech synthesis.
Findings
High correlation between predicted and ground truth emotion strength.
Effective generalization to unseen speech data.
Improved emotion strength assessment accuracy.
Abstract
Recently, emotional speech synthesis has achieved remarkable performance. The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function. However, a trained ranking function on specific data has poor generalization, which limits its applicability for more realistic cases. In this paper, we propose a deep learning based emotion strength assessment network for strength prediction that is referred to as StrengthNet. Our model conforms to a multi-task learning framework with a structure that includes an acoustic encoder, a strength predictor and an auxiliary emotion predictor. A data augmentation strategy was utilized to improve the model generalization. Experiments show that the predicted emotion strength of the proposed StrengthNet are highly correlated with ground truth scores for seen and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
