GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Youngik Kim, Hoon-Young Cho

TL;DR
GANSpeech introduces an adversarial training approach for multi-speaker TTS that significantly improves speech quality and outperforms existing models in subjective listening tests, achieving high-fidelity synthesis with less speaker dependence.
Contribution
The paper proposes GANSpeech, a novel adversarially trained multi-speaker TTS model with automatic feature matching loss scaling, enhancing speech quality over baseline models.
Findings
GANSpeech outperforms baseline FastSpeech and FastSpeech2 in subjective tests.
GANSpeech achieves higher MOS scores than speaker-specific fine-tuned FastSpeech2.
Adversarial training improves multi-speaker TTS fidelity.
Abstract
Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
