JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech
Dan Lim, Sunghee Jung, Eesung Kim

TL;DR
This paper introduces an end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module, simplifying the training process and improving synthesis quality without needing fine-tuning or external alignment tools.
Contribution
The novel joint training framework integrates FastSpeech2 and HiFi-GAN with an alignment learning objective, eliminating the need for separate training stages and external alignment tools.
Findings
Outperforms state-of-the-art TTS models on LJSpeech in subjective MOS scores.
Simplifies training pipeline by removing fine-tuning and external alignment dependencies.
Achieves higher synthesis quality with end-to-end joint training.
Abstract
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsHiFi-GAN
