Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions
Yu Gu, Yongguo Kang

TL;DR
This paper presents Multi-task WaveNet, an improved speech synthesis model that generates natural speech conditioned solely on linguistic features by integrating multi-task learning, eliminating the need for external pitch prediction.
Contribution
The paper introduces a multi-task learning framework for WaveNet that removes the external pitch predictor, enhancing speech naturalness and simplifying inference in statistical parametric speech synthesis.
Findings
Achieves better objective and subjective speech quality than state-of-the-art methods.
Addresses pitch prediction error accumulation in WaveNet.
Simplifies inference by removing external pitch prediction model.
Abstract
This paper introduces an improved generative model for statistical parametric speech synthesis (SPSS) based on WaveNet under a multi-task learning framework. Different from the original WaveNet model, the proposed Multi-task WaveNet employs the frame-level acoustic feature prediction as the secondary task and the external fundamental frequency prediction model for the original WaveNet can be removed. Therefore the improved WaveNet can generate high-quality speech waveforms only conditioned on linguistic features. Multi-task WaveNet can produce more natural and expressive speech by addressing the pitch prediction error accumulation issue and possesses more succinct inference procedures than the original WaveNet. Experimental results prove that the SPSS method proposed in this paper can achieve better performance than the state-of-the-art approach utilizing the original WaveNet in both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
