Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael, Valle, Rohan Badlani, Boris Ginsburg

TL;DR
This paper enhances the robustness of LLM-based speech synthesis by enforcing monotonic alignment through CTC loss and attention priors, reducing errors like repetitions and misalignments.
Contribution
It introduces a novel guided attention training method that improves alignment robustness without adding new parameters in LLM-based TTS models.
Findings
Significant reduction in alignment errors and hallucinations.
Improved speech naturalness and consistency.
Enhanced robustness in multi-occurrence token scenarios.
Abstract
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need · Connectionist Temporal Classification Loss
