FastSpeech: Fast, Robust and Controllable Text to Speech

Yi Ren; Yangjun Ruan; Xu Tan; Tao Qin; Sheng Zhao; Zhou Zhao; Tie-Yan; Liu

arXiv:1905.09263·cs.CL·November 21, 2019·580 cites

FastSpeech: Fast, Robust and Controllable Text to Speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan, Liu

PDF

Open Access 5 Repos 5 Models

TL;DR

FastSpeech introduces a parallel Transformer-based TTS model that significantly accelerates speech synthesis, improves robustness, and allows controllable voice speed, matching the quality of autoregressive models.

Contribution

The paper presents a novel feed-forward Transformer-based TTS model that enables fast, robust, and controllable speech synthesis with parallel generation.

Findings

01

Speeds up mel-spectrogram generation by 270x

02

Reduces word skipping and repetition issues

03

Achieves comparable speech quality to autoregressive models

Abstract

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Mixture of Logistic Distributions · Sigmoid Activation · Convolution · Batch Normalization · Max Pooling · Tanh Activation