A New GAN-based End-to-End TTS Training Algorithm
Haohan Guo, Frank K. Soong, Lei He, Lei Xie

TL;DR
This paper introduces a GAN-based training algorithm for end-to-end TTS models that reduces exposure bias and improves stability, alignment, and perceptual quality over standard methods.
Contribution
It proposes a novel GAN-based training approach incorporating real and generated data, inspired by Professor Forcing, to enhance TTS model training.
Findings
Preferred in subjective listening tests with a CMOS of 0.1
Significant improvement in sentence-level intelligibility on pathological data
More stable alignment production for Tacotron output
Abstract
End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional one. However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real and predicted data. While real data is available in training, but in testing, only predicted data is available to feed the autoregressive module. By introducing both real and generated data sequences in training, we can alleviate the effects of the exposure bias. We propose to use Generative Adversarial Network (GAN) along with the key idea of Professor Forcing in training. A discriminator in GAN is jointly trained to equalize the difference between real and predicted data. In AB subjective listening test, the results show that the new approach is preferred over the standard transfer learning with a CMOS improvement of 0.1. Sentence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network · CBHG
