AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin,, Tie-Yan Liu

TL;DR
AdaSpeech 4 is a zero-shot adaptive TTS system that enhances speaker generalization and quality by systematic speaker modeling, enabling high-quality voice synthesis without fine-tuning.
Contribution
The paper introduces a novel speaker modeling approach using basis vectors, conditional layer normalization, and a distribution-based supervision loss for improved zero-shot TTS.
Findings
Achieves better voice quality and similarity than baselines
Operates effectively without fine-tuning on new speakers
Demonstrates strong generalization across multiple datasets
Abstract
Adaptive text to speech (TTS) can synthesize new voices in zero-shot scenarios efficiently, by using a well-trained source TTS model without adapting it on the speech data of new speakers. Considering seen and unseen speakers have diverse characteristics, zero-shot adaptive TTS requires strong generalization ability on speaker characteristics, which brings modeling challenges. In this paper, we develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis. We model the speaker characteristics systematically to improve the generalization on new speakers. Generally, the modeling of speaker characteristics can be categorized into three steps: extracting speaker representation, taking this speaker representation as condition, and synthesizing speech/mel-spectrogram given this speaker representation. Accordingly, we improve the modeling in three steps: 1) To extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
MethodsLayer Normalization
