Exploiting Deep Sentential Context for Expressive End-to-End Speech   Synthesis

Fengyu Yang; Shan Yang; Qinghua Wu; Yujun Wang; Lei Xie

arXiv:2008.00613·eess.AS·August 4, 2020

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis

Fengyu Yang, Shan Yang, Qinghua Wu, Yujun Wang, Lei Xie

PDF

Open Access

TL;DR

This paper introduces a context extractor built on self-attention networks to better utilize sentential context in expressive speech synthesis, significantly improving prosody and naturalness in generated speech.

Contribution

It proposes a novel context extraction method that aggregates multi-layer SAN outputs to enhance prosodic modeling in end-to-end TTS systems, especially for expressive corpora.

Findings

01

Enhanced prosody and expressiveness in synthesized speech.

02

Weighted aggregation outperforms direct aggregation in modeling expressivity.

03

Approach improves naturalness on expressive speech corpora.

Abstract

Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques