Cross-speaker Emotion Transfer Based On Prosody Compensation for   End-to-End Speech Synthesis

Tao Li; Xinsheng Wang; Qicong Xie; Zhichao Wang; Mingqi Jiang; Lei Xie

arXiv:2207.01198·cs.SD·July 5, 2022

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, Mingqi Jiang, Lei Xie

PDF

Open Access

TL;DR

This paper introduces a prosody compensation module that enhances emotional information transfer in cross-speaker speech synthesis, ensuring emotional fidelity while preserving target speaker timbre.

Contribution

The paper proposes a novel prosody compensation module utilizing global context blocks to improve emotion transfer in end-to-end speech synthesis.

Findings

01

PCM effectively compensates emotional information loss.

02

The method maintains target speaker timbre.

03

Outperforms state-of-the-art models in emotion transfer.

Abstract

Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing