AdaSpeech: Adaptive Text to Speech for Custom Voice
Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao,, Tie-Yan Liu

TL;DR
AdaSpeech is an adaptive TTS system that efficiently customizes high-quality voices for individual speakers using minimal data by employing novel acoustic encoding and conditional normalization techniques.
Contribution
The paper introduces AdaSpeech, a new adaptive TTS framework that effectively handles diverse acoustic conditions and reduces adaptation parameters for personalized voice synthesis.
Findings
Achieves superior voice adaptation quality over baselines
Uses only about 5K parameters per speaker for customization
Effective with as little as 20 sentences of speech data
Abstract
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
Methodstravel james · Layer Normalization
