Parallel Tacotron: Non-Autoregressive and Controllable TTS
Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss,, Yonghui Wu

TL;DR
Parallel Tacotron is a non-autoregressive TTS model that enhances efficiency and naturalness through variational autoencoders, lightweight convolutions, and iterative spectrogram loss, achieving comparable quality to autoregressive models with faster inference.
Contribution
It introduces a highly parallelizable non-autoregressive TTS model with variational autoencoder-based residual encoder and iterative refinement techniques, improving naturalness and efficiency.
Findings
Matches autoregressive baseline in subjective evaluations
Significantly reduces inference time
Utilizes variational autoencoder for better naturalness
Abstract
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
