Multi-modal Adversarial Training for Zero-Shot Voice Cloning

John Janiczek; Dading Chong; Dongyang Dai; Arlo Faria; Chao Wang; Tao; Wang; Yuzong Liu

arXiv:2408.15916·eess.AS·August 29, 2024

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

John Janiczek, Dading Chong, Dongyang Dai, Arlo Faria, Chao Wang, Tao, Wang, Yuzong Liu

PDF

Open Access

TL;DR

This paper introduces a novel adversarial training method using a Transformer-based discriminator to enhance zero-shot voice cloning in TTS systems, significantly improving speech naturalness and speaker similarity.

Contribution

It proposes a new adversarial training approach with a Transformer discriminator for zero-shot voice cloning, improving upon existing GAN-based methods.

Findings

01

Enhanced speech quality and naturalness

02

Improved speaker similarity in zero-shot cloning

03

Effective training on large multi-speaker datasets

Abstract

A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings