GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain   Text-to-Speech

Rongjie Huang; Yi Ren; Jinglin Liu; Chenye Cui; Zhou Zhao

arXiv:2205.07211·eess.AS·October 14, 2022·21 cites

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

PDF

Open Access 2 Repos 1 Video

TL;DR

GenerSpeech introduces a novel TTS model that effectively performs high-fidelity zero-shot style transfer for out-of-domain speech, handling diverse styles and improving robustness over existing methods.

Contribution

The paper presents a new TTS framework with multi-level style adaptation and a generalizable content adaptor, enabling superior zero-shot style transfer for out-of-domain speech.

Findings

01

Outperforms state-of-the-art models in audio quality and style similarity.

02

Demonstrates robustness in few-shot adaptive style transfer.

03

Effectively models diverse style conditions including speaker, emotion, and prosody.

Abstract

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLayer Normalization