iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis based on Disentanglement between Prosody and Timbre
Guangyan Zhang, Ying Qin, Wenjie Zhang, Jialun Wu, Mei Li, Yutao Gai,, Feijun Jiang, Tan Lee

TL;DR
iEmoTTS is a novel speech synthesis system that enables robust cross-speaker emotion transfer and control by disentangling prosody and timbre, allowing zero-shot emotion transfer and controllable emotional speech generation.
Contribution
The paper introduces iEmoTTS, a system that disentangles prosody and timbre for improved cross-speaker emotion transfer in speech synthesis, including zero-shot capabilities.
Findings
Effective emotion transfer demonstrated through subjective evaluation.
Able to produce speech with specified emotion types and controllable intensities.
Outperforms recent systems in cross-speaker emotion transfer tasks.
Abstract
The capability of generating speech with specific type of emotion is desired for many applications of human-computer interaction. Cross-speaker emotion transfer is a common approach to generating emotional speech when speech with emotion labels from target speakers is not available for model training. This paper presents a novel cross-speaker emotion transfer system, named iEmoTTS. The system is composed of an emotion encoder, a prosody predictor, and a timbre encoder. The emotion encoder extracts the identity of emotion type as well as the respective emotion intensity from the mel-spectrogram of input speech. The emotion intensity is measured by the posterior probability that the input utterance carries that emotion. The prosody predictor is used to provide prosodic features for emotion transfer. The timber encoder provides timbre-related information for the system. Unlike many other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
