Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad,, Matt Shannon, Julian Salazar, David Kao

TL;DR
This paper presents a novel alignment mechanism for autoregressive Transformer-based text-to-speech systems that significantly improves their robustness and ability to generalize to arbitrarily long utterances without errors.
Contribution
Introduces a relative location-based alignment mechanism for AR Transformer TTS that enhances length generalization and robustness without external alignment data.
Findings
System matches baseline naturalness and expressiveness.
Eliminates repeated or dropped words in long utterances.
Enables generalization to any practical utterance length.
Abstract
Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis
Methods*Communicated@Fast*How Do I Communicate to Expedia? · [LivE@PeRson]How do I talk to a real person at Expedia? · Sigmoid Activation · Batch Normalization · Highway Layer · Tanh Activation · Bidirectional GRU · Convolution · Residual Connection · Max Pooling
