Emotional End-to-End Neural Speech Synthesizer

Younggun Lee; Azam Rabiee; Soo-Young Lee

arXiv:1711.05447·cs.SD·November 7, 2018·58 cites

Emotional End-to-End Neural Speech Synthesizer

Younggun Lee, Azam Rabiee, Soo-Young Lee

PDF

Open Access 1 Repo

TL;DR

This paper presents an end-to-end neural speech synthesizer capable of generating emotional speech, addressing key issues like exposure bias and attention irregularity through novel model enhancements.

Contribution

The paper introduces improvements to Tacotron by using context vectors and residual connections to better handle emotional speech synthesis.

Findings

01

Successful training and speech generation for emotion labels

02

Improved attention alignment and stability in synthesis

03

Effective handling of exposure bias in neural speech models

Abstract

In this paper, we introduce an emotional speech synthesizer based on the recent end-to-end neural model, named Tacotron. Despite its benefits, we found that the original Tacotron suffers from the exposure bias problem and irregularity of the attention alignment. Later, we address the problem by utilization of context vector and residual connection at recurrent neural networks (RNNs). Our experiments showed that the model could successfully train and generate speech for given emotion labels.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AzamRabiee/Emotional-TTS
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network · CBHG