Neural Speech Synthesis with Transformer Network

Naihan Li; Shujie Liu; Yanqing Liu; Sheng Zhao; Ming Liu; Ming Zhou

arXiv:1809.08895·cs.CL·January 31, 2019·39 cites

Neural Speech Synthesis with Transformer Network

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou

PDF

Open Access 5 Repos 10 Models

TL;DR

This paper introduces a Transformer-based neural TTS model that significantly improves training efficiency and long-range dependency modeling, achieving state-of-the-art speech synthesis quality close to human levels.

Contribution

The paper adapts Transformer multi-head self-attention to neural TTS, replacing RNNs, resulting in faster training and better long-term dependency modeling.

Findings

01

Training speed increased by 4.25 times compared to Tacotron2

02

Achieved state-of-the-art performance with a MOS gap of 0.048

03

Generated speech quality close to human level

Abstract

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Mixture of Logistic Distributions · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?