Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural   Text-to-Speech

Yusuke Nakai; Yuki Saito; Kenta Udagawa; and Hiroshi Saruwatari

arXiv:2209.12549·cs.SD·September 27, 2022

Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech

Yusuke Nakai, Yuki Saito, Kenta Udagawa, and Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces a multi-task adversarial training algorithm for multi-speaker neural TTS that enhances speech quality and generalizes well to unseen speakers by jointly training a discriminator and generator.

Contribution

The proposed algorithm extends GAN-based training for multi-speaker TTS by incorporating speaker verification, improving synthesis quality and unseen speaker generalization.

Findings

01

Outperforms conventional GAN-based TTS in speech quality

02

Enhances generalization to unseen speakers

03

Achieves high-quality multi-speaker synthesis

Abstract

We propose a novel training algorithm for a multi-speaker neural text-to-speech (TTS) model based on multi-task adversarial training. A conventional generative adversarial network (GAN)-based training algorithm significantly improves the quality of synthetic speech by reducing the statistical difference between natural and synthetic speech. However, the algorithm does not guarantee the generalization performance of the trained TTS model in synthesizing voices of unseen speakers who are not included in the training data. Our algorithm alternatively trains two deep neural networks: multi-task discriminator and multi-speaker neural TTS model (i.e., generator of GANs). The discriminator is trained not only to distinguish between natural and synthetic speech but also to verify the speaker of input speech is existent or non-existent (i.e., newly generated by interpolating seen speakers'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing