Distribution augmentation for low-resource expressive text-to-speech
Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu, Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet,, Thomas Drugman, Trevor Wood, Elena Sokolova

TL;DR
This paper introduces a data augmentation method for low-resource expressive TTS that syntactically preserves text and audio fragments, enhancing diversity, reducing overfitting, and improving speech quality and robustness.
Contribution
The novel augmentation technique generates new training examples without extra data, significantly benefiting low-resource TTS by increasing diversity and robustness.
Findings
Improves speech quality across multiple datasets and architectures.
Enhances robustness of attention-based TTS models.
Reduces overfitting in low-resource settings.
Abstract
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
