Neural Speech Synthesis with Transformer Network
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou

TL;DR
This paper introduces a Transformer-based neural TTS model that significantly improves training efficiency and long-range dependency modeling, achieving state-of-the-art speech synthesis quality close to human levels.
Contribution
The paper adapts Transformer multi-head self-attention to neural TTS, replacing RNNs, resulting in faster training and better long-term dependency modeling.
Findings
Training speed increased by 4.25 times compared to Tacotron2
Achieved state-of-the-art performance with a MOS gap of 0.048
Generated speech quality close to human level
Abstract
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗bayartsogt/tts_transformer-mn-mbspeechmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗facebook/tts_transformer-ar-cv7model· 17 dl· ♡ 817 dl♡ 8
- 🤗facebook/tts_transformer-en-200_speaker-cv4model· 23 dl· ♡ 223 dl♡ 2
- 🤗facebook/tts_transformer-en-ljspeechmodel· 5 dl· ♡ 65 dl♡ 6
- 🤗facebook/tts_transformer-es-css10model· 8 dl· ♡ 308 dl♡ 30
- 🤗facebook/tts_transformer-fr-cv7_css10model· 4 dl· ♡ 54 dl♡ 5
- 🤗facebook/tts_transformer-ru-cv7_css10model· 9 dl· ♡ 139 dl♡ 13
- 🤗facebook/tts_transformer-tr-cv7model· 10 dl· ♡ 1310 dl♡ 13
- 🤗facebook/tts_transformer-vi-cv7model· 5 dl· ♡ 115 dl♡ 11
- 🤗facebook/tts_transformer-zh-cv7_css10model· 10 dl· ♡ 8510 dl♡ 85
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Mixture of Logistic Distributions · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?
