Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis
Guanghui Xu, Wei Song, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen, Zhou

TL;DR
This paper enhances end-to-end speech synthesis by incorporating cross-utterance BERT embeddings to improve prosody modeling, resulting in more natural and expressive speech in paragraph-level synthesis.
Contribution
It introduces a novel method using cross-utterance BERT embeddings to incorporate discourse-level context into TTS systems without explicit prosody features.
Findings
Improved naturalness and expressiveness in synthesized speech.
Participants preferred the CU encoder generated voices in listening tests.
Prosody can be indirectly controlled by neighboring sentence embeddings.
Abstract
Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utterance of a paragraph in an end-to-end fashion without using any explicit prosody features. More specifically, cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsLinear Layer · Dropout · Attention Dropout · Softmax · Multi-Head Attention · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · WordPiece · Layer Normalization
