GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

TL;DR
GenerSpeech introduces a novel TTS model that effectively performs high-fidelity zero-shot style transfer for out-of-domain speech, handling diverse styles and improving robustness over existing methods.
Contribution
The paper presents a new TTS framework with multi-level style adaptation and a generalizable content adaptor, enabling superior zero-shot style transfer for out-of-domain speech.
Findings
Outperforms state-of-the-art models in audio quality and style similarity.
Demonstrates robustness in few-shot adaptive style transfer.
Effectively models diverse style conditions including speaker, emotion, and prosody.
Abstract
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsLayer Normalization
