Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech   with Untranscribed Data

Sungwon Kim; Heeseung Kim; Sungroh Yoon

arXiv:2205.15370·cs.SD·June 1, 2022·21 cites

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data

Sungwon Kim, Heeseung Kim, Sungroh Yoon

PDF

Open Access

TL;DR

Guided-TTS 2 introduces a diffusion-based model capable of high-quality, adaptive text-to-speech synthesis using only untranscribed data, enabling rapid speaker adaptation and zero-shot multi-speaker performance.

Contribution

It combines a speaker-conditional diffusion model with a phoneme classifier and fine-tunes efficiently for adaptive TTS using untranscribed data, advancing zero-shot and rapid speaker adaptation.

Findings

01

Achieves comparable quality to single-speaker TTS with only 10 seconds of untranscribed data.

02

Outperforms existing adaptive TTS methods on multi-speaker datasets.

03

Enables voice adaptation for non-human characters using untranscribed speech.

Abstract

We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data. Guided-TTS 2 combines a speaker-conditional diffusion model with a speaker-dependent phoneme classifier for adaptive text-to-speech. We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method and further fine-tune the diffusion model on the reference speech of the target speaker for adaptation, which only takes 40 seconds. We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a ten-second untranscribed data. We further show that Guided-TTS 2 outperforms adaptive TTS baselines on multi-speaker datasets even with a zero-shot adaptation setting. Guided-TTS 2 can adapt to a wide range of voices only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsDiffusion