Robust and Unbounded Length Generalization in Autoregressive   Transformer-Based Text-to-Speech

Eric Battenberg; RJ Skerry-Ryan; Daisy Stanton; Soroosh Mariooryad,; Matt Shannon; Julian Salazar; David Kao

arXiv:2410.22179·cs.CL·March 13, 2025

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad,, Matt Shannon, Julian Salazar, David Kao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents a novel alignment mechanism for autoregressive Transformer-based text-to-speech systems that significantly improves their robustness and ability to generalize to arbitrarily long utterances without errors.

Contribution

Introduces a relative location-based alignment mechanism for AR Transformer TTS that enhances length generalization and robustness without external alignment data.

Findings

01

System matches baseline naturalness and expressiveness.

02

Eliminates repeated or dropped words in long utterances.

03

Enables generalization to any practical utterance length.

Abstract

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google/sequence-layers/blob/main/examples/very_attentive_tacotron.py
tfOfficial

Videos

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech· underline

Taxonomy

TopicsSpeech Recognition and Synthesis

Methods*Communicated@Fast*How Do I Communicate to Expedia? · [LivE@PeRson]How do I talk to a real person at Expedia? · Sigmoid Activation · Batch Normalization · Highway Layer · Tanh Activation · Bidirectional GRU · Convolution · Residual Connection · Max Pooling