Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data
Sungwon Kim, Heeseung Kim, Sungroh Yoon

TL;DR
Guided-TTS 2 introduces a diffusion-based model capable of high-quality, adaptive text-to-speech synthesis using only untranscribed data, enabling rapid speaker adaptation and zero-shot multi-speaker performance.
Contribution
It combines a speaker-conditional diffusion model with a phoneme classifier and fine-tunes efficiently for adaptive TTS using untranscribed data, advancing zero-shot and rapid speaker adaptation.
Findings
Achieves comparable quality to single-speaker TTS with only 10 seconds of untranscribed data.
Outperforms existing adaptive TTS methods on multi-speaker datasets.
Enables voice adaptation for non-human characters using untranscribed speech.
Abstract
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data. Guided-TTS 2 combines a speaker-conditional diffusion model with a speaker-dependent phoneme classifier for adaptive text-to-speech. We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method and further fine-tune the diffusion model on the reference speech of the target speaker for adaptation, which only takes 40 seconds. We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a ten-second untranscribed data. We further show that Guided-TTS 2 outperforms adaptive TTS baselines on multi-speaker datasets even with a zero-shot adaptation setting. Guided-TTS 2 can adapt to a wide range of voices only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsDiffusion
