Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, Mingqi Jiang, Lei Xie

TL;DR
This paper introduces a prosody compensation module that enhances emotional information transfer in cross-speaker speech synthesis, ensuring emotional fidelity while preserving target speaker timbre.
Contribution
The paper proposes a novel prosody compensation module utilizing global context blocks to improve emotion transfer in end-to-end speech synthesis.
Findings
PCM effectively compensates emotional information loss.
The method maintains target speaker timbre.
Outperforms state-of-the-art models in emotion transfer.
Abstract
Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
