Emotional End-to-End Neural Speech Synthesizer
Younggun Lee, Azam Rabiee, Soo-Young Lee

TL;DR
This paper presents an end-to-end neural speech synthesizer capable of generating emotional speech, addressing key issues like exposure bias and attention irregularity through novel model enhancements.
Contribution
The paper introduces improvements to Tacotron by using context vectors and residual connections to better handle emotional speech synthesis.
Findings
Successful training and speech generation for emotion labels
Improved attention alignment and stability in synthesis
Effective handling of exposure bias in neural speech models
Abstract
In this paper, we introduce an emotional speech synthesizer based on the recent end-to-end neural model, named Tacotron. Despite its benefits, we found that the original Tacotron suffers from the exposure bias problem and irregularity of the attention alignment. Later, we address the problem by utilization of context vector and residual connection at recurrent neural networks (RNNs). Our experiments showed that the model could successfully train and generate speech for given emotion labels.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network · CBHG
