UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

Heeseung Kim; Sungwon Kim; Jiheum Yeom; Sungroh Yoon

arXiv:2306.16083·cs.SD·June 29, 2023

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon

PDF

Open Access 1 Repo 1 Datasets

TL;DR

UnitSpeech introduces a novel speaker-adaptive speech synthesis method that fine-tunes a diffusion-based TTS model with minimal untranscribed data using self-supervised unit representations, enabling personalized TTS and voice conversion.

Contribution

It is the first to integrate self-supervised unit representations into a diffusion-based TTS model for effective speaker adaptation with untranscribed data.

Findings

01

Achieves comparable and superior results on personalized TTS and voice conversion.

02

Operates effectively with minimal untranscribed data.

03

Demonstrates broad adaptability to real-world data and various tasks.

Abstract

We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single $<$ unit, speech $>$ pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gmltmd789/UnitSpeech
pytorchOfficial

Datasets

purdueviperlab/diffssd
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing