Investigating gated recurrent neural networks for speech synthesis
Zhizheng Wu, Simon King

TL;DR
This paper investigates the effectiveness of gated recurrent neural networks, specifically LSTMs, for speech synthesis, analyzing their components and proposing a simplified architecture that reduces complexity while maintaining quality.
Contribution
It provides insights into why LSTMs work well for speech synthesis and introduces a simplified, less complex architecture with comparable performance.
Findings
LSTMs outperform deep feed-forward networks in SPSS
Component analysis identifies key gates influencing performance
Simplified architecture reduces parameters without quality loss
Abstract
Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged as a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies have demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feed-forward neural networks, little is known about why. Here we attempt to answer two questions: a) why do LSTMs work well as a sequence model for SPSS; b) which component (e.g., input gate, output gate, forget gate) is most important. We present a visual analysis alongside a series of experiments, resulting in a proposal for a simplified architecture. The simplified architecture has significantly fewer parameters than an LSTM, thus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
