Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga, Zen, Yonghui Wu

TL;DR
This paper introduces Non-Attentive Tacotron, a neural TTS model that replaces attention with explicit duration prediction, enhancing robustness and controllability, and enabling unsupervised duration modeling with minimal quality loss.
Contribution
It proposes a novel non-attentive TTS architecture with explicit duration modeling, improving robustness and allowing unsupervised training of duration predictors.
Findings
Significant robustness improvements over Tacotron 2.
Achieved high naturalness scores with Gaussian upsampling.
Effective semi-supervised duration training with minimal quality loss.
Abstract
This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Chemical Sensor Technologies · Blind Source Separation Techniques
MethodsHighway Layer · Max Pooling · Dilated Causal Convolution · Tanh Activation · Highway Network · Sigmoid Activation · Residual GRU · Residual Connection · Griffin-Lim Algorithm · Bidirectional GRU
