Speech Synthesis with Mixed Emotions
Kun Zhou, Berrak Sisman, Rajib Rana, B. W. Schuller, Haizhou Li

TL;DR
This paper introduces a novel speech synthesis framework that can generate speech with mixed emotions by measuring and controlling emotional differences, enabling more nuanced and realistic emotional speech synthesis.
Contribution
It presents the first method for modeling, synthesizing, and evaluating mixed emotions in speech using a sequence-to-sequence framework with a novel emotion difference formulation.
Findings
Effective control of mixed emotions in synthesized speech
Validated through objective and subjective evaluations
First study to model and synthesize mixed emotions
Abstract
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Speech and Audio Processing
