Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech
Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly,, Penny Karanasou, Thomas Drugman

TL;DR
This paper presents Kathaka, a neural TTS model that learns prosody distribution at the sentence level and uses contextual information from text to improve naturalness, achieving significant quality improvements.
Contribution
Introduces a novel two-stage training process for neural TTS that incorporates contextual prosody sampling using BERT and graph-attention networks.
Findings
13.2% relative improvement in naturalness over baseline
Effective prosody modeling at sentence level
Robust sampling method with consistent improvements
Abstract
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Softmax · Dense Connections · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Residual Connection · Adam · Dropout · Weight Decay
