Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic   Alignment

Paarth Neekhara; Shehzeen Hussain; Subhankar Ghosh; Jason Li; Rafael; Valle; Rohan Badlani; Boris Ginsburg

arXiv:2406.17957·cs.SD·June 27, 2024

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael, Valle, Rohan Badlani, Boris Ginsburg

PDF

Open Access 1 Models

TL;DR

This paper enhances the robustness of LLM-based speech synthesis by enforcing monotonic alignment through CTC loss and attention priors, reducing errors like repetitions and misalignments.

Contribution

It introduces a novel guided attention training method that improves alignment robustness without adding new parameters in LLM-based TTS models.

Findings

01

Significant reduction in alignment errors and hallucinations.

02

Improved speech naturalness and consistency.

03

Enhanced robustness in multi-occurrence token scenarios.

Abstract

Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nvidia/magpie_tts_multilingual_357m
model· 1.3k dl· ♡ 111
1.3k dl♡ 111

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Connectionist Temporal Classification Loss