Prosodic Representation Learning and Contextual Sampling for Neural   Text-to-Speech

Sri Karlapati; Ammar Abbas; Zack Hodari; Alexis Moinet; Arnaud Joly,; Penny Karanasou; Thomas Drugman

arXiv:2011.02252·eess.AS·November 5, 2020

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly,, Penny Karanasou, Thomas Drugman

PDF

TL;DR

This paper presents Kathaka, a neural TTS model that learns prosody distribution at the sentence level and uses contextual information from text to improve naturalness, achieving significant quality improvements.

Contribution

Introduces a novel two-stage training process for neural TTS that incorporates contextual prosody sampling using BERT and graph-attention networks.

Findings

01

13.2% relative improvement in naturalness over baseline

02

Effective prosody modeling at sentence level

03

Robust sampling method with consistent improvements

Abstract

In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Softmax · Dense Connections · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Residual Connection · Adam · Dropout · Weight Decay