One TTS Alignment To Rule Them All

Rohan Badlani; Adrian {\L}ancucki; Kevin J. Shih; Rafael Valle; Wei; Ping; Bryan Catanzaro

arXiv:2108.10447·cs.SD·August 25, 2021

One TTS Alignment To Rule Them All

Rohan Badlani, Adrian {\L}ancucki, Kevin J. Shih, Rafael Valle, Wei, Ping, Bryan Catanzaro

PDF

Open Access 3 Repos 8 Models 1 Video

TL;DR

This paper introduces a universal alignment learning framework for neural TTS models that enhances alignment robustness, convergence speed, and speech quality across various architectures, including autoregressive and non-autoregressive models.

Contribution

It adapts the RAD-TTS alignment mechanism with a combined forward-sum and Viterbi algorithm, improving TTS model performance and robustness.

Findings

01

Improved alignment convergence speed.

02

Enhanced robustness to long utterance errors.

03

Better perceived speech quality by human evaluators.

Abstract

Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive endto-end TTS models rely on durations extracted from external sources. In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework, easily applicable to a variety of neural TTS models. The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior. In our experiments, the alignment learning framework improves all tested TTS architectures, both autoregressive (Flowtron, Tacotron 2) and non-autoregressive (FastPitch, FastSpeech 2,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

NVIDIA’s Amazing AI Clones Your Voice! 🤐· youtube

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems

MethodsAttention Is All You Need · Layer Normalization · Softmax · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · FastSpeech 2 · Sigmoid Activation · Highway Layer · Residual Connection