Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
Wei Fang, Yu-An Chung, James Glass

TL;DR
This paper explores integrating deep pre-trained language models like BERT into end-to-end TTS systems to reduce dependence on high-quality data, showing improvements in training efficiency and decoding accuracy.
Contribution
It introduces a novel method of incorporating BERT representations into Tacotron-2, enhancing training convergence and decoding accuracy without requiring high-quality speech data.
Findings
Faster training convergence observed with BERT integration
Improved accuracy in determining when to stop decoding
No significant increase in speech naturalness or clarity
Abstract
Modern text-to-speech (TTS) systems are able to generate audio that sounds almost as natural as human speech. However, the bar of developing high-quality TTS systems remains high since a sizable set of studio-quality <text, audio> pairs is usually required. Compared to commercial data used to develop state-of-the-art systems, publicly available data are usually worse in terms of both quality and size. Audio generated by TTS systems trained on publicly available data tends to not only sound less natural, but also exhibits more background noise. In this work, we aim to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training. In particular, we investigate the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder. BERT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax
