Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis   Including Unsupervised Duration Modeling

Jonathan Shen; Ye Jia; Mike Chrzanowski; Yu Zhang; Isaac Elias; Heiga; Zen; Yonghui Wu

arXiv:2010.04301·cs.SD·May 12, 2021·73 cites

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga, Zen, Yonghui Wu

PDF

Open Access 5 Repos

TL;DR

This paper introduces Non-Attentive Tacotron, a neural TTS model that replaces attention with explicit duration prediction, enhancing robustness and controllability, and enabling unsupervised duration modeling with minimal quality loss.

Contribution

It proposes a novel non-attentive TTS architecture with explicit duration modeling, improving robustness and allowing unsupervised training of duration predictors.

Findings

01

Significant robustness improvements over Tacotron 2.

02

Achieved high naturalness scores with Gaussian upsampling.

03

Effective semi-supervised duration training with minimal quality loss.

Abstract

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Chemical Sensor Technologies · Blind Source Separation Techniques

MethodsHighway Layer · Max Pooling · Dilated Causal Convolution · Tanh Activation · Highway Network · Sigmoid Activation · Residual GRU · Residual Connection · Griffin-Lim Algorithm · Bidirectional GRU