Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices
Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus, Henderson, Przemys{\l}aw Szczepaniak

TL;DR
This paper presents optimized LSTM-RNN-based statistical parametric speech synthesizers tailored for mobile devices, achieving a balance of high naturalness, low latency, and compact model size through various technical enhancements.
Contribution
The paper introduces specific optimizations like weight quantization and multi-frame inference that enable LSTM-RNN speech synthesis to run efficiently on mobile devices without sacrificing quality.
Findings
Optimizations enable real-time LSTM-RNN synthesis on mobile devices.
Synthesized speech maintains high naturalness comparable to HMM-based systems.
LSTM-RNN-based synthesis outperforms HMM in latency and quality after optimization.
Abstract
Acoustic models based on long short-term memory recurrent neural networks (LSTM-RNNs) were applied to statistical parametric speech synthesis (SPSS) and showed significant improvements in naturalness and latency over those based on hidden Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based SPSS for deployment on mobile devices; weight quantization, multi-frame inference, and robust inference using an {\epsilon}-contaminated Gaussian loss function. Experimental results in subjective listening tests show that these optimizations can make LSTM-RNN-based SPSS comparable to HMM-based SPSS in runtime speed while maintaining naturalness. Evaluations between LSTM-RNN- based SPSS and HMM-driven unit selection speech synthesis are also presented.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
