UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice   Conversion

Haogeng Liu; Tao Wang; Ruibo Fu; Jiangyan Yi; Zhengqi Wen; Jianhua Tao

arXiv:2301.03801·cs.SD·January 11, 2023·1 cites

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Jianhua Tao

PDF

Open Access

TL;DR

UnifySpeech introduces a unified framework that combines text-to-speech and voice conversion by decoupling speech into content, speaker, and prosody components, enhancing capabilities in both tasks.

Contribution

This work is the first to unify TTS and VC within a single model using speech component decoupling and domain bridging techniques.

Findings

01

TTS achieves improved speaker modeling.

02

VC demonstrates enhanced speech content decoupling.

03

Unified framework benefits both tasks.

Abstract

Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing