Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech
Yusuke Nakai, Yuki Saito, Kenta Udagawa, and Hiroshi Saruwatari

TL;DR
This paper introduces a multi-task adversarial training algorithm for multi-speaker neural TTS that enhances speech quality and generalizes well to unseen speakers by jointly training a discriminator and generator.
Contribution
The proposed algorithm extends GAN-based training for multi-speaker TTS by incorporating speaker verification, improving synthesis quality and unseen speaker generalization.
Findings
Outperforms conventional GAN-based TTS in speech quality
Enhances generalization to unseen speakers
Achieves high-quality multi-speaker synthesis
Abstract
We propose a novel training algorithm for a multi-speaker neural text-to-speech (TTS) model based on multi-task adversarial training. A conventional generative adversarial network (GAN)-based training algorithm significantly improves the quality of synthetic speech by reducing the statistical difference between natural and synthetic speech. However, the algorithm does not guarantee the generalization performance of the trained TTS model in synthesizing voices of unseen speakers who are not included in the training data. Our algorithm alternatively trains two deep neural networks: multi-task discriminator and multi-speaker neural TTS model (i.e., generator of GANs). The discriminator is trained not only to distinguish between natural and synthetic speech but also to verify the speaker of input speech is existent or non-existent (i.e., newly generated by interpolating seen speakers'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
