USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Wenbin Wang; Yang Song; Sanjay Jha

arXiv:2404.18094·cs.SD·April 30, 2024

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Wenbin Wang, Yang Song, Sanjay Jha

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces USAT, a unified TTS framework that combines zero-shot and few-shot speaker adaptation to synthesize lifelike speech for both native and non-native speakers with limited data, addressing generalization and storage challenges.

Contribution

USAT is the first framework to unify zero-shot and few-shot speaker adaptation in TTS, with novel discriminators, memory mechanisms, and adapters to improve generalization and prevent forgetting.

Findings

01

Enhanced voice synthesis for unseen speakers with limited data.

02

Reduced storage and overfitting through novel adapters and procedures.

03

Improved performance on diverse accents and speaker types.

Abstract

Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mushanshanshan/esltts
noneOfficial

Datasets

MushanW/ESLTTS
dataset· 25 dl
25 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.