Towards Transfer Learning for End-to-End Speech Synthesis from Deep   Pre-Trained Language Models

Wei Fang; Yu-An Chung; James Glass

arXiv:1906.07307·cs.CL·June 19, 2019·26 cites

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models

Wei Fang, Yu-An Chung, James Glass

PDF

Open Access

TL;DR

This paper explores integrating deep pre-trained language models like BERT into end-to-end TTS systems to reduce dependence on high-quality data, showing improvements in training efficiency and decoding accuracy.

Contribution

It introduces a novel method of incorporating BERT representations into Tacotron-2, enhancing training convergence and decoding accuracy without requiring high-quality speech data.

Findings

01

Faster training convergence observed with BERT integration

02

Improved accuracy in determining when to stop decoding

03

No significant increase in speech naturalness or clarity

Abstract

Modern text-to-speech (TTS) systems are able to generate audio that sounds almost as natural as human speech. However, the bar of developing high-quality TTS systems remains high since a sizable set of studio-quality <text, audio> pairs is usually required. Compared to commercial data used to develop state-of-the-art systems, publicly available data are usually worse in terms of both quality and size. Audio generated by TTS systems trained on publicly available data tends to not only sound less natural, but also exhibits more background noise. In this work, we aim to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training. In particular, we investigate the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder. BERT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax