Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data   for Zero-Shot Multi-Speaker Text-to-Speech

Byoung Jin Choi; Myeonghun Jeong; Minchan Kim; Sung Hwan Mun; Nam Soo; Kim

arXiv:2210.05979·eess.AS·November 23, 2022

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Byoung Jin Choi, Myeonghun Jeong, Minchan Kim, Sung Hwan Mun, Nam Soo, Kim

PDF

Open Access

TL;DR

This paper introduces adversarial speaker-consistency learning (ASCL), a novel approach that improves zero-shot multi-speaker TTS by using untranscribed speech data and adversarial training to enhance speaker similarity and quality.

Contribution

The paper proposes a new adversarial learning framework that leverages untranscribed speech data to address speaker domain shift in zero-shot multi-speaker TTS.

Findings

01

Improved speaker similarity in zero-shot TTS

02

Enhanced speech quality over baseline models

03

Effective use of untranscribed data in training

Abstract

Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main challenge of ZSM-TTS is the speaker domain shift problem upon the speech generation of a new speaker. To mitigate this problem, we propose adversarial speaker-consistency learning (ASCL). The proposed method first generates an additional speech of a query speaker using the external untranscribed datasets at each training iteration. Then, the model learns to consistently generate the speech sample of the same speaker as the corresponding speaker embedding vector by employing an adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing