Speech BERT Embedding For Improving Prosody in Neural TTS
Liping Chen, Yan Deng, Xi Wang, Frank K. Soong, Lei He

TL;DR
This paper introduces a speech BERT model that extracts segment-level prosody embeddings to enhance the naturalness and expressiveness of neural TTS, demonstrating improved prosody and listener preference.
Contribution
It proposes a novel BERT-based prosody embedding method that captures fine-grained prosody information to improve neural TTS output.
Findings
Reduced objective distortion in TTS output
Subjective preference for BERT-enhanced TTS
Effective prosody modeling across multiple speakers
Abstract
This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous segment of a fixed length in the proposed BERT. The extracted embedding is then used together with the mel-spectrogram to predict the following segment in the TTS decoder. Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech. The objective distortions measured on a single speaker TTS are reduced between the generated speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Phonetics and Phonology Research
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · WordPiece
