Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep, Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan,, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu

TL;DR
This paper introduces Tacotron 2, a neural speech synthesis system combining a sequence-to-sequence spectrogram predictor with a modified WaveNet vocoder, achieving high-quality natural speech comparable to professional recordings.
Contribution
It presents a novel end-to-end neural TTS system that simplifies WaveNet architecture by conditioning it on mel spectrograms, improving synthesis quality and efficiency.
Findings
Achieved MOS of 4.53, close to professional recordings.
Validated the effectiveness of mel spectrogram conditioning.
Demonstrated architecture simplification benefits.
Abstract
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of comparable to a MOS of for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗dathudeptrai/tts-tacotron2-synpaflex-frmodel· ♡ 1♡ 1
- 🤗tensorspeech/tts-tacotron2-baker-chmodel· ♡ 7♡ 7
- 🤗tensorspeech/tts-tacotron2-kss-komodel· ♡ 5♡ 5
- 🤗tensorspeech/tts-tacotron2-ljspeech-enmodel
- 🤗tensorspeech/tts-tacotron2-synpaflex-frmodel
- 🤗tensorspeech/tts-tacotron2-thorsten-germodel
- 🤗speechbrain/tts-tacotron2-ljspeechmodel· 686 dl· ♡ 139686 dl♡ 139
- 🤗ahnafsamin/Tacotron2-groningsmodel· 3 dl3 dl
- 🤗infinisoft/ttsmodel· ♡ 4♡ 4
- 🤗Bilgilice/bilgilice35model
Videos
Google's Text Reader AI: Almost Perfect | Two Minute Papers #228· youtube
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsZoneout · Long Short-Term Memory · Mixture of Logistic Distributions · Location Sensitive Attention · Bidirectional LSTM · Linear Layer · Exponential Decay · Weight Decay · Tacotron2 · Residual Connection
