Investigating gated recurrent neural networks for speech synthesis

Zhizheng Wu; Simon King

arXiv:1601.02539·cs.CL·January 12, 2016·19 cites

Investigating gated recurrent neural networks for speech synthesis

Zhizheng Wu, Simon King

PDF

Open Access

TL;DR

This paper investigates the effectiveness of gated recurrent neural networks, specifically LSTMs, for speech synthesis, analyzing their components and proposing a simplified architecture that reduces complexity while maintaining quality.

Contribution

It provides insights into why LSTMs work well for speech synthesis and introduces a simplified, less complex architecture with comparable performance.

Findings

01

LSTMs outperform deep feed-forward networks in SPSS

02

Component analysis identifies key gates influencing performance

03

Simplified architecture reduces parameters without quality loss

Abstract

Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged as a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies have demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feed-forward neural networks, little is known about why. Here we attempt to answer two questions: a) why do LSTMs work well as a sequence model for SPSS; b) which component (e.g., input gate, output gate, forget gate) is most important. We present a visual analysis alongside a series of experiments, resulting in a proposal for a simplified architecture. The simplified architecture has significantly fewer parameters than an LSTM, thus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory