USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
Wenbin Wang, Yang Song, Sanjay Jha

TL;DR
This paper introduces USAT, a unified TTS framework that combines zero-shot and few-shot speaker adaptation to synthesize lifelike speech for both native and non-native speakers with limited data, addressing generalization and storage challenges.
Contribution
USAT is the first framework to unify zero-shot and few-shot speaker adaptation in TTS, with novel discriminators, memory mechanisms, and adapters to improve generalization and prevent forgetting.
Findings
Enhanced voice synthesis for unseen speakers with limited data.
Reduced storage and overfitting through novel adapters and procedures.
Improved performance on diverse accents and speaker types.
Abstract
Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
