UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
Heeseung Kim, Sungwon Kim, Jiheum Yeom, Sungroh Yoon

TL;DR
UnitSpeech introduces a novel speaker-adaptive speech synthesis method that fine-tunes a diffusion-based TTS model with minimal untranscribed data using self-supervised unit representations, enabling personalized TTS and voice conversion.
Contribution
It is the first to integrate self-supervised unit representations into a diffusion-based TTS model for effective speaker adaptation with untranscribed data.
Findings
Achieves comparable and superior results on personalized TTS and voice conversion.
Operates effectively with minimal untranscribed data.
Demonstrates broad adaptability to real-world data and various tasks.
Abstract
We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single unit, speech pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
