ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis
Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

TL;DR
ZeSTA introduces a domain-conditioned training method that effectively uses zero-shot TTS data for personalized speech synthesis, improving speaker similarity with limited data without altering the core model.
Contribution
The paper presents ZeSTA, a novel domain-conditioned training framework that enhances low-resource personalized TTS by stabilizing adaptation and preserving quality using synthetic data.
Findings
Improves speaker similarity over naive augmentation
Maintains intelligibility and perceptual quality
Effective on LibriTTS and proprietary datasets
Abstract
We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
