Semi-Supervised Learning Based on Reference Model for Low-resource TTS
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

TL;DR
This paper introduces a semi-supervised neural TTS approach that leverages a reference model and pseudo labels to improve speech naturalness and robustness in low-resource scenarios.
Contribution
It proposes a novel semi-supervised training scheme combining pre-training and pseudo label guidance for low-resource neural TTS.
Findings
Significant improvement in voice naturalness and robustness.
Effective reduction of overfitting with limited target data.
Enhanced performance over traditional supervised methods.
Abstract
Most previous neural text-to-speech (TTS) methods are mainly based on supervised learning methods, which means they depend on a large training dataset and hard to achieve comparable performance under low-resource conditions. To address this issue, we propose a semi-supervised learning method for neural TTS in which labeled target data is limited, which can also resolve the problem of exposure bias in the previous auto-regressive models. Specifically, we pre-train the reference model based on Fastspeech2 with much source data, fine-tuned on a limited target dataset. Meanwhile, pseudo labels generated by the original reference model are used to guide the fine-tuned model's training further, achieve a regularization effect, and reduce the overfitting of the fine-tuned model during training on the limited target data. Experimental results show that our proposed semi-supervised learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
MethodsTest
