JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to   Speech

Dan Lim; Sunghee Jung; Eesung Kim

arXiv:2203.16852·eess.AS·July 5, 2022·1 cites

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

Dan Lim, Sunghee Jung, Eesung Kim

PDF

Open Access 2 Repos

TL;DR

This paper introduces an end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module, simplifying the training process and improving synthesis quality without needing fine-tuning or external alignment tools.

Contribution

The novel joint training framework integrates FastSpeech2 and HiFi-GAN with an alignment learning objective, eliminating the need for separate training stages and external alignment tools.

Findings

01

Outperforms state-of-the-art TTS models on LJSpeech in subjective MOS scores.

02

Simplifies training pipeline by removing fine-tuning and external alignment dependencies.

03

Achieves higher synthesis quality with end-to-end joint training.

Abstract

In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsHiFi-GAN