One TTS Alignment To Rule Them All
Rohan Badlani, Adrian {\L}ancucki, Kevin J. Shih, Rafael Valle, Wei, Ping, Bryan Catanzaro

TL;DR
This paper introduces a universal alignment learning framework for neural TTS models that enhances alignment robustness, convergence speed, and speech quality across various architectures, including autoregressive and non-autoregressive models.
Contribution
It adapts the RAD-TTS alignment mechanism with a combined forward-sum and Viterbi algorithm, improving TTS model performance and robustness.
Findings
Improved alignment convergence speed.
Enhanced robustness to long utterance errors.
Better perceived speech quality by human evaluators.
Abstract
Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive endto-end TTS models rely on durations extracted from external sources. In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework, easily applicable to a variety of neural TTS models. The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior. In our experiments, the alignment learning framework improves all tested TTS architectures, both autoregressive (Flowtron, Tacotron 2) and non-autoregressive (FastPitch, FastSpeech 2,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/tts_en_fastpitchmodel· 272 dl· ♡ 40272 dl♡ 40
- 🤗infinisoft/ttsmodel· ♡ 4♡ 4
- 🤗theodotus/tts_uk_fastpitchmodel· 10 dl· ♡ 210 dl♡ 2
- 🤗Bilgilice/bilgilice35model
- 🤗Mastering-Python-HF/nvidia_tts_en_fastpitch_multispeakermodel· 1 dl· ♡ 11 dl♡ 1
- 🤗Mastering-Python-HF/nvidia_tts_en_hifitts_hifigan_ft_fastpitchmodel· 17 dl· ♡ 117 dl♡ 1
- 🤗praveenchordia/ttsmodel· ♡ 1♡ 1
- 🤗Pendrokar/xva_fastpitch1_1model
Videos
NVIDIA’s Amazing AI Clones Your Voice! 🤐· youtube
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
MethodsAttention Is All You Need · Layer Normalization · Softmax · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · FastSpeech 2 · Sigmoid Activation · Highway Layer · Residual Connection
