Multi-task WaveNet: A Multi-task Generative Model for Statistical   Parametric Speech Synthesis without Fundamental Frequency Conditions

Yu Gu; Yongguo Kang

arXiv:1806.08619·eess.AS·June 25, 2018·5 cites

Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions

Yu Gu, Yongguo Kang

PDF

Open Access

TL;DR

This paper presents Multi-task WaveNet, an improved speech synthesis model that generates natural speech conditioned solely on linguistic features by integrating multi-task learning, eliminating the need for external pitch prediction.

Contribution

The paper introduces a multi-task learning framework for WaveNet that removes the external pitch predictor, enhancing speech naturalness and simplifying inference in statistical parametric speech synthesis.

Findings

01

Achieves better objective and subjective speech quality than state-of-the-art methods.

02

Addresses pitch prediction error accumulation in WaveNet.

03

Simplifies inference by removing external pitch prediction model.

Abstract

This paper introduces an improved generative model for statistical parametric speech synthesis (SPSS) based on WaveNet under a multi-task learning framework. Different from the original WaveNet model, the proposed Multi-task WaveNet employs the frame-level acoustic feature prediction as the secondary task and the external fundamental frequency prediction model for the original WaveNet can be removed. Therefore the improved WaveNet can generate high-quality speech waveforms only conditioned on linguistic features. Multi-task WaveNet can produce more natural and expressive speech by addressing the pitch prediction error accumulation issue and possesses more succinct inference procedures than the original WaveNet. Experimental results prove that the SPSS method proposed in this paper can achieve better performance than the state-of-the-art approach utilizing the original WaveNet in both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing