Improving Prosody Modelling with Cross-Utterance BERT Embeddings for   End-to-end Speech Synthesis

Guanghui Xu; Wei Song; Zhengchen Zhang; Chao Zhang; Xiaodong He; Bowen; Zhou

arXiv:2011.05161·eess.AS·November 11, 2020

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Guanghui Xu, Wei Song, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen, Zhou

PDF

Open Access

TL;DR

This paper enhances end-to-end speech synthesis by incorporating cross-utterance BERT embeddings to improve prosody modeling, resulting in more natural and expressive speech in paragraph-level synthesis.

Contribution

It introduces a novel method using cross-utterance BERT embeddings to incorporate discourse-level context into TTS systems without explicit prosody features.

Findings

01

Improved naturalness and expressiveness in synthesized speech.

02

Participants preferred the CU encoder generated voices in listening tests.

03

Prosody can be indirectly controlled by neighboring sentence embeddings.

Abstract

Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utterance of a paragraph in an end-to-end fashion without using any explicit prosody features. More specifically, cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsLinear Layer · Dropout · Attention Dropout · Softmax · Multi-Head Attention · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · WordPiece · Layer Normalization