A Deep Learning Approach to Data-driven Parameterizations for   Statistical Parametric Speech Synthesis

Prasanna Kumar Muthukumar; Alan W. Black

arXiv:1409.8558·cs.CL·October 1, 2014·5 cites

A Deep Learning Approach to Data-driven Parameterizations for Statistical Parametric Speech Synthesis

Prasanna Kumar Muthukumar, Alan W. Black

PDF

Open Access

TL;DR

This paper introduces a data-driven, deep learning-based parameterization of the Mel Log Spectrum tailored for statistical parametric speech synthesis, aiming to improve synthesis quality over traditional Mel Cepstral coefficients.

Contribution

It proposes a novel invertible, low-dimensional encoding using a tapered Stacked Denoising Autoencoder combined with a fine-tuned MLP for better spectrum parameterization in synthesis.

Findings

01

Improved speech synthesis quality with the new parameterization

02

Robustness to noise in the encoding process

03

Better fulfillment of synthesis requirements compared to traditional methods

Abstract

Nearly all Statistical Parametric Speech Synthesizers today use Mel Cepstral coefficients as the vocal tract parameterization of the speech signal. Mel Cepstral coefficients were never intended to work in a parametric speech synthesis framework, but as yet, there has been little success in creating a better parameterization that is more suited to synthesis. In this paper, we use deep learning algorithms to investigate a data-driven parameterization technique that is designed for the specific requirements of synthesis. We create an invertible, low-dimensional, noise-robust encoding of the Mel Log Spectrum by training a tapered Stacked Denoising Autoencoder (SDA). This SDA is then unwrapped and used as the initialization for a Multi-Layer Perceptron (MLP). The MLP is fine-tuned by training it to reconstruct the input at the output layer. This MLP is then split down the middle to form…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDenoising Autoencoder · Solana Customer Service Number +1-833-534-1729