Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation
Kishor Kayyar Lakshminarayana, Christian Dittmar, Nicola Pia and, Emanu\"el Habets

TL;DR
This paper introduces a simple noise augmentation method for low-resource text-to-speech synthesis, enabling high-quality single-speaker voice generation with minimal data and minimal computational overhead.
Contribution
It extends Tacotron-2 with a straightforward noise augmentation technique, achieving comparable quality to models trained on much larger datasets.
Findings
Using 2 hours of data, human ratings matched a 23.5-hour trained baseline.
The model maintained similar intelligibility levels in semantic unpredictability tests.
Simple stationary noise augmentation is effective for low-resource TTS.
Abstract
Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
