Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset
Ze Liu

TL;DR
This paper evaluates how transfer learning enhances deep learning-based Text-to-Speech models in low-resource, few-shot scenarios, aiming to improve voice quality with limited data and reduce training time.
Contribution
It provides a comprehensive analysis and experimental comparison of transfer learning effectiveness in TTS models on small, customized datasets, highlighting optimal approaches for low-resource conditions.
Findings
Transfer learning significantly improves TTS performance with limited data.
Certain models outperform others in low-resource settings.
Transfer learning reduces training time while maintaining high voice quality.
Abstract
Text-to-Speech (TTS) synthesis using deep learning relies on voice quality. Modern TTS models are advanced, but they need large amount of data. Given the growing computational complexity of these models and the scarcity of large, high-quality datasets, this research focuses on transfer learning, especially on few-shot, low-resource, and customized datasets. In this research, "low-resource" specifically refers to situations where there are limited amounts of training data, such as a small number of audio recordings and corresponding transcriptions for a particular language or dialect. This thesis, is rooted in the pressing need to find TTS models that require less training time, fewer data samples, yet yield high-quality voice output. The research evaluates TTS state-of-the-art model transfer learning capabilities through a thorough technical analysis. It then conducts a hands-on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
