FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

TL;DR
FastSpeech 2 introduces a more accurate, efficient, and high-quality non-autoregressive TTS model by directly training with ground-truth data and incorporating speech variation features, surpassing previous models in speed and quality.
Contribution
The paper presents FastSpeech 2, a novel TTS model that simplifies training, improves voice quality, and enables end-to-end waveform generation, advancing non-autoregressive speech synthesis.
Findings
3x faster training than FastSpeech
Superior voice quality over FastSpeech
Faster inference with FastSpeech 2s
Abstract
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/fastspeech2-en-200_speaker-cv4model· 7 dl· ♡ 67 dl♡ 6
- 🤗facebook/fastspeech2-en-ljspeechmodel· 53 dl· ♡ 27253 dl♡ 272
- 🤗tensorspeech/tts-fastspeech2-baker-chmodel· ♡ 6♡ 6
- 🤗tensorspeech/tts-fastspeech2-kss-komodel· ♡ 8♡ 8
- 🤗tensorspeech/tts-fastspeech2-ljspeech-enmodel· ♡ 1♡ 1
- 🤗ahnafsamin/FastSpeech2-groningsmodel· 1 dl1 dl
- 🤗speechbrain/tts-fastspeech2-ljspeechmodel· 57 dl· ♡ 757 dl♡ 7
- 🤗MarcNg/fastspeech2-vi-inforemodel· 2 dl· ♡ 62 dl♡ 6
- 🤗infinisoft/ttsmodel· ♡ 4♡ 4
- 🤗Bilgilice/bilgilice35model
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Knowledge Distillation · Layer Normalization · Residual Connection · Batch Normalization · Softmax · Linear Layer · GAN Least Squares Loss · Multi-Head Attention
