Referee: Towards reference-free cross-speaker style transfer with   low-quality data for expressive speech synthesis

Songxiang Liu; Shan Yang; Dan Su; Dong Yu

arXiv:2109.03439·eess.AS·September 9, 2021

Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

Songxiang Liu, Shan Yang, Dan Su, Dong Yu

PDF

Open Access

TL;DR

Referee introduces a reference-free, robust cross-speaker style transfer method for expressive speech synthesis that leverages low-quality data and combines text-to-style and style-to-wave models for high-fidelity output.

Contribution

It proposes a novel cascade approach using a text-to-style model with a pretrain-refinement method and a style-to-wave model, enabling style transfer without high-quality reference data.

Findings

01

Referee outperforms GST-based baseline in style transfer quality.

02

Effective learning of speaking styles from low-quality data.

03

High-fidelity speech synthesis achieved with the proposed method.

Abstract

Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speaking style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence. This work presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text. Referee is built by cascading a text-to-style (T2S) model with a style-to-wave (S2W) model. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using the T2S model. A novel pretrain-refinement method is adopted to learn a robust T2S…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing