Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch   Disentangling with Untranscribed Data

Xulong Zhang; Jianzong Wang; Ning Cheng; Jing Xiao

arXiv:2210.13803·cs.SD·October 26, 2022

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

PDF

Open Access

TL;DR

Adapitch is a multi-speaker TTS system that improves voice synthesis quality by leveraging untranscribed data through self-supervised modules and disentangling pitch, text, and speaker content.

Contribution

It introduces a novel adaptation framework for multi-speaker TTS that utilizes untranscribed data and content disentanglement for enhanced synthesis quality.

Findings

01

Achieved significantly better quality than baseline methods.

02

Effectively utilizes untranscribed data for TTS adaptation.

03

Disentangles pitch, text, and speaker for improved prosody control.

Abstract

In this paper, we proposed Adapitch, a multi-speaker TTS method that makes adaptation of the supervised module with untranscribed data. We design two self supervised modules to train the text encoder and mel decoder separately with untranscribed data to enhance the representation of text and mel. To better handle the prosody information in a synthesized voice, a supervised TTS module is designed conditioned on content disentangling of pitch, text, and speaker. The training phase was separated into two parts, pretrained and fixed the text encoder and mel decoder with unsupervised mode, then the supervised mode on the disentanglement of TTS. Experiment results show that the Adaptich achieved much better quality than baseline methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing