Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data
Zhu Li, Yuqing Zhang, Mengxi Nie, Ming Yan, Mengnan He, Ruixiong, Zhang, Caixia Gong

TL;DR
This paper enhances speech synthesis prosody for unseen texts by combining a fine-tuned BERT front-end with a pre-trained FastSpeech2 model, leveraging noisy data and linguistic tasks to improve naturalness.
Contribution
It introduces a multi-task learning framework that fine-tunes BERT on linguistic tasks and pre-trains FastSpeech2 on noisy data to improve prosody in unseen text synthesis.
Findings
Improved prosody for complex sentences.
Enhanced naturalness in synthesized speech.
Effective use of noisy external data.
Abstract
Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone disambiguation task, the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task, and the prosody structure prediction (PSP) task in a multi-task learning framework. FastSpeech 2 is pre-trained on large-scale external data that are noisy but easier to obtain. Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · *Communicated@Fast*How Do I Communicate to Expedia? · WordPiece · Weight Decay · Dropout · Residual Connection
