Non-Autoregressive Neural Text-to-Speech

Kainan Peng; Wei Ping; Zhao Song; Kexin Zhao

arXiv:1905.08459·cs.CL·July 1, 2020·26 cites

Non-Autoregressive Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces ParaNet, a fast, fully convolutional non-autoregressive text-to-speech model that significantly accelerates synthesis while maintaining good speech quality, and explores parallel vocoders including a VAE-based IAF approach.

Contribution

The paper presents ParaNet, a novel non-autoregressive TTS model with improved speed and stable alignment, and introduces a VAE-based training method for parallel vocoders.

Findings

01

Achieves 46.7x faster synthesis than Deep Voice 3.

02

Produces stable text-speech alignment.

03

Demonstrates effective parallel vocoder synthesis.

Abstract

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Non-Autoregressive Neural Text-to-Speech· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Neural Networks and Applications

MethodsAttention Is All You Need · Bridge-net · Normalizing Flows · WaveNet · ClariNet · WaveVAE · Weight Normalization · Softmax · L1 Regularization · *Communicated@Fast*How Do I Communicate to Expedia?